Re: Mail regarding draft-ietf-ltru-4646bis and draft-ietf-ltru-matching

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Mail regarding draft-ietf-ltru-4646bis and draft-ietf-ltru-matching

Doug Ewell-4
On July 27, Florian Rivoal wrote:
 
> However, RFC5646 Section 4.5, which defines canonicalization, only
> does so for language tags, not for language ranges. Presumably, the
> process is largely the same, with wildcards in the language subtag
> being preserved, and I suppose wildcards in other subtags would likely
> be dropped. But as it stands, that seems undefined.
 
I think you are on the right track by assuming that ranges are
canonicalized just like tags, with asterisks left alone.
 
It's not very likely that most LTRU participants will be eager to start
up a new IETF project to update 5646 for something like this. Best to go
with your assumption.
 
> Also, while giving recommendations about canonicalization for the
> purpose of filtering, it would seem useful to mention (and possibly to
> recommend) canonicalizing to the "extlang form". The definition of the
> extlang form itself (in  RFC5646 Section 4.5) mentions that it is
> useful for matching and selecting, but that information isn't relayed
> anywhere RFC4647.
 
At the time these documents were written, there was a strong sentiment
around de-emphasizing extlangs in general. It's good to know that
there's a real-world use case for using them here. Again, it's unlikely
that people will want to rev 4647 for this.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org
 

_______________________________________________
Ltru mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/ltru
Reply | Threaded
Open this post in threaded view
|

Re: Mail regarding draft-ietf-ltru-4646bis and draft-ietf-ltru-matching

Florian Rivoal


> On Aug 28, 2019, at 2:47, Doug Ewell <[hidden email]> wrote:
>
> On July 27, Florian Rivoal wrote:
>
>> However, RFC5646 Section 4.5, which defines canonicalization, only
>> does so for language tags, not for language ranges. Presumably, the
>> process is largely the same, with wildcards in the language subtag
>> being preserved, and I suppose wildcards in other subtags would likely
>> be dropped. But as it stands, that seems undefined.
>
> I think you are on the right track by assuming that ranges are
> canonicalized just like tags, with asterisks left alone.

Thanks for confirming.

> It's not very likely that most LTRU participants will be eager to start
> up a new IETF project to update 5646 for something like this. Best to go
> with your assumption.
>
>> Also, while giving recommendations about canonicalization for the
>> purpose of filtering, it would seem useful to mention (and possibly to
>> recommend) canonicalizing to the "extlang form". The definition of the
>> extlang form itself (in  RFC5646 Section 4.5) mentions that it is
>> useful for matching and selecting, but that information isn't relayed
>> anywhere RFC4647.
>
> At the time these documents were written, there was a strong sentiment
> around de-emphasizing extlangs in general. It's good to know that
> there's a real-world use case for using them here. Again, it's unlikely
> that people will want to rev 4647 for this.

The use case is CSS selectors, when writing rules for typography/styling in a document. On the one hand, the document gets marked up which part of it are in which language. On the other side, the style sheet describes which part of the document must be styled which way, and can make that styling dependent on the language.

The need for normalization comes from the fact that stylesheet authoring and document authoring are not coordinated in the general case, so a stylesheet author cannot know, generally speaking, if a document will be marked up with, for example, zh-yue or yue. The stylesheet author is then faced with two options, both unattractive for different reasons:
* use the deprecated tag: it's more likely to be found in existing documents due to being older. The first downside is that it doesn't always work. The second one is that this slows down adoption of the newer preferred tag, as document authors wanting to be compatible with existing stylesheets will keep on using the deprecated one as well for compatibility, and we get into a vicious cycle of everybody continuing to use the deprecated variant.

* Use both the deprecated and the preferred tag in the stylesheet's selector. This works, but it means that stylesheet authors need to be aware of, and manually replicate the information in https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry. Asking people to manually do what software could isn't great, as they tend not to, or to do it with bugs, or to not update when the registry updates, etc.

So it seems preferable, given that this correspondence is maintained in a neatly usable format, to have CSS renderers deal with the correspondence between deprecated and preferred tags by way of canonicalizing to the extlang form and doing the selector matching on that.

In the long run, both document authors and stylesheet authors should use the preferred tag without the extlang prefix, and the canonicalization to extang form will be invisible to them. But even if some don't, everything works.

—Florian
_______________________________________________
Ltru mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/ltru
Reply | Threaded
Open this post in threaded view
|

Re: Mail regarding draft-ietf-ltru-4646bis and draft-ietf-ltru-matching

Doug Ewell-4
In reply to this post by Doug Ewell-4
Florian Rivoal wrote:
 
> The stylesheet author is then faced with two options, both
> unattractive for different reasons:
>
> * use the deprecated tag: [...]
>
> * Use both the deprecated and the preferred tag in the stylesheet's
> selector. [...]
 
Extlang subtags, and tags like "zh-yue" that use them, aren't
deprecated. Extlang subtags have a Preferred-Value, but they don't have
a Deprecated value. While this is a subtle distinction, "deprecated" is
a term of art in BCP 47. It might be better to use a more unwieldy term
like "non-preferred."
 
Section 3.1.7 says:
 
"For records of type 'extlang', the 'Preferred-Value' field appears
without a corresponding 'Deprecated' field.  An implementation MAY
ignore these preferred value mappings, although if it ignores the
mapping, it SHOULD do so consistently.  It SHOULD also treat the
'Preferred-Value' as equivalent to the mapped item.  For example, the
tags "zh-yue-Hant-HK" and "yue-Hant-HK" are semantically equivalent and
ought to be treated as if they were the same tag."
 
It does go on to say:
 
"The 'Preferred-Value' field in subtag records of type "extlang" also
contains an "extended language range".  This allows the subtag to be
deprecated in favor of either a single primary language subtag or a new
language-extlang sequence."
 
but this is confusing to me, as I'm not sure who does the "deprecating"
here.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org
 

_______________________________________________
Ltru mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/ltru
Reply | Threaded
Open this post in threaded view
|

Re: Mail regarding draft-ietf-ltru-4646bis and draft-ietf-ltru-matching

John Cowan


On Wed, Aug 28, 2019 at 12:52 PM Doug Ewell <[hidden email]> wrote:

"The 'Preferred-Value' field in subtag records of type "extlang" also
contains an "extended language range".  This allows the subtag to be
deprecated in favor of either a single primary language subtag or a new
language-extlang sequence."

I think this means that if SIL decides to change a primary language tag that is also an extlang tag, we can deprecate the old extlang tag as well.  Another scenario would be that they decide some language doesn't belong in its macrolanguage, and we need to deprecate the extlang tag while leaving the primary language tag alone.   Hopefully neither of these will ever happen.


John Cowan          http://vrici.lojban.org/~cowan        [hidden email]
    "Any legal document draws most of its meaning from context.  A telegram
    that says 'SELL HUNDRED THOUSAND SHARES IBM SHORT' (only 190 bits in
    5-bit Baudot code plus appropriate headers) is as good a legal document
    as any, even sans digital signature." --me


_______________________________________________
Ltru mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/ltru
Reply | Threaded
Open this post in threaded view
|

Re: Mail regarding draft-ietf-ltru-4646bis and draft-ietf-ltru-matching

Mark Davis ☕
In reply to this post by Florian Rivoal
Canonicalizing to the extlang form has a number of disadvantages, and I would recommend strongly against it. Don't have time to discuss now, will see about next week.

Mark


On Wed, Aug 28, 2019 at 11:56 AM Florian Rivoal <[hidden email]> wrote:


> On Aug 28, 2019, at 2:47, Doug Ewell <[hidden email]> wrote:
>
> On July 27, Florian Rivoal wrote:
>
>> However, RFC5646 Section 4.5, which defines canonicalization, only
>> does so for language tags, not for language ranges. Presumably, the
>> process is largely the same, with wildcards in the language subtag
>> being preserved, and I suppose wildcards in other subtags would likely
>> be dropped. But as it stands, that seems undefined.
>
> I think you are on the right track by assuming that ranges are
> canonicalized just like tags, with asterisks left alone.

Thanks for confirming.

> It's not very likely that most LTRU participants will be eager to start
> up a new IETF project to update 5646 for something like this. Best to go
> with your assumption.
>
>> Also, while giving recommendations about canonicalization for the
>> purpose of filtering, it would seem useful to mention (and possibly to
>> recommend) canonicalizing to the "extlang form". The definition of the
>> extlang form itself (in  RFC5646 Section 4.5) mentions that it is
>> useful for matching and selecting, but that information isn't relayed
>> anywhere RFC4647.
>
> At the time these documents were written, there was a strong sentiment
> around de-emphasizing extlangs in general. It's good to know that
> there's a real-world use case for using them here. Again, it's unlikely
> that people will want to rev 4647 for this.

The use case is CSS selectors, when writing rules for typography/styling in a document. On the one hand, the document gets marked up which part of it are in which language. On the other side, the style sheet describes which part of the document must be styled which way, and can make that styling dependent on the language.

The need for normalization comes from the fact that stylesheet authoring and document authoring are not coordinated in the general case, so a stylesheet author cannot know, generally speaking, if a document will be marked up with, for example, zh-yue or yue. The stylesheet author is then faced with two options, both unattractive for different reasons:
* use the deprecated tag: it's more likely to be found in existing documents due to being older. The first downside is that it doesn't always work. The second one is that this slows down adoption of the newer preferred tag, as document authors wanting to be compatible with existing stylesheets will keep on using the deprecated one as well for compatibility, and we get into a vicious cycle of everybody continuing to use the deprecated variant.

* Use both the deprecated and the preferred tag in the stylesheet's selector. This works, but it means that stylesheet authors need to be aware of, and manually replicate the information in https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry. Asking people to manually do what software could isn't great, as they tend not to, or to do it with bugs, or to not update when the registry updates, etc.

So it seems preferable, given that this correspondence is maintained in a neatly usable format, to have CSS renderers deal with the correspondence between deprecated and preferred tags by way of canonicalizing to the extlang form and doing the selector matching on that.

In the long run, both document authors and stylesheet authors should use the preferred tag without the extlang prefix, and the canonicalization to extang form will be invisible to them. But even if some don't, everything works.

—Florian
_______________________________________________
Ltru mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/ltru

_______________________________________________
Ltru mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/ltru
Reply | Threaded
Open this post in threaded view
|

Re: Mail regarding draft-ietf-ltru-4646bis and draft-ietf-ltru-matching

Florian Rivoal


> On Aug 29, 2019, at 15:19, Mark Davis ☕️ <[hidden email]> wrote:
>
> Canonicalizing to the extlang form has a number of disadvantages, and I would recommend strongly against it. Don't have time to discuss now, will see about next week.

Very interested in what you have to say about this.

In case that influences what you have to say, note that what I intend to do is not to store the canonicalized-to-extlang form anywhere. It would only be for internal processing: when performing an extended filtering operation, where it is unknown whether the ranges and tags are in extlang form or not, canonicalize both to extlang form do the extended filtering operation on that.

That way, a zh-*-Hant selector would match both a zh-yue-Hant document element and a yue-Hant one.

—Florian


> On Wed, Aug 28, 2019 at 11:56 AM Florian Rivoal <[hidden email]> wrote:
>
>
> > On Aug 28, 2019, at 2:47, Doug Ewell <[hidden email]> wrote:
> >
> > On July 27, Florian Rivoal wrote:
> >
> >> However, RFC5646 Section 4.5, which defines canonicalization, only
> >> does so for language tags, not for language ranges. Presumably, the
> >> process is largely the same, with wildcards in the language subtag
> >> being preserved, and I suppose wildcards in other subtags would likely
> >> be dropped. But as it stands, that seems undefined.
> >
> > I think you are on the right track by assuming that ranges are
> > canonicalized just like tags, with asterisks left alone.
>
> Thanks for confirming.
>
> > It's not very likely that most LTRU participants will be eager to start
> > up a new IETF project to update 5646 for something like this. Best to go
> > with your assumption.
> >
> >> Also, while giving recommendations about canonicalization for the
> >> purpose of filtering, it would seem useful to mention (and possibly to
> >> recommend) canonicalizing to the "extlang form". The definition of the
> >> extlang form itself (in  RFC5646 Section 4.5) mentions that it is
> >> useful for matching and selecting, but that information isn't relayed
> >> anywhere RFC4647.
> >
> > At the time these documents were written, there was a strong sentiment
> > around de-emphasizing extlangs in general. It's good to know that
> > there's a real-world use case for using them here. Again, it's unlikely
> > that people will want to rev 4647 for this.
>
> The use case is CSS selectors, when writing rules for typography/styling in a document. On the one hand, the document gets marked up which part of it are in which language. On the other side, the style sheet describes which part of the document must be styled which way, and can make that styling dependent on the language.
>
> The need for normalization comes from the fact that stylesheet authoring and document authoring are not coordinated in the general case, so a stylesheet author cannot know, generally speaking, if a document will be marked up with, for example, zh-yue or yue. The stylesheet author is then faced with two options, both unattractive for different reasons:
> * use the deprecated tag: it's more likely to be found in existing documents due to being older. The first downside is that it doesn't always work. The second one is that this slows down adoption of the newer preferred tag, as document authors wanting to be compatible with existing stylesheets will keep on using the deprecated one as well for compatibility, and we get into a vicious cycle of everybody continuing to use the deprecated variant.
>
> * Use both the deprecated and the preferred tag in the stylesheet's selector. This works, but it means that stylesheet authors need to be aware of, and manually replicate the information in https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry. Asking people to manually do what software could isn't great, as they tend not to, or to do it with bugs, or to not update when the registry updates, etc.
>
> So it seems preferable, given that this correspondence is maintained in a neatly usable format, to have CSS renderers deal with the correspondence between deprecated and preferred tags by way of canonicalizing to the extlang form and doing the selector matching on that.
>
> In the long run, both document authors and stylesheet authors should use the preferred tag without the extlang prefix, and the canonicalization to extang form will be invisible to them. But even if some don't, everything works.
>
> —Florian
> _______________________________________________
> Ltru mailing list
> [hidden email]
> https://www.ietf.org/mailman/listinfo/ltru

_______________________________________________
Ltru mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/ltru
Reply | Threaded
Open this post in threaded view
|

Re: Mail regarding draft-ietf-ltru-4646bis and draft-ietf-ltru-matching

John Cowan


On Thu, Aug 29, 2019 at 5:00 AM Florian Rivoal <[hidden email]> wrote:

In case that influences what you have to say, note that what I intend to do is not to store the canonicalized-to-extlang form anywhere. It would only be for internal processing: when performing an extended filtering operation, where it is unknown whether the ranges and tags are in extlang form or not, canonicalize both to extlang form do the extended filtering operation on that.

In that case you can equally canonicalize away from the extlang form as toward it.  I recommend that.


John Cowan          http://vrici.lojban.org/~cowan        [hidden email]
        "Not to know The Smiths is not to know K.X.U."  --K.X.U.


_______________________________________________
Ltru mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/ltru
Reply | Threaded
Open this post in threaded view
|

Re: Mail regarding draft-ietf-ltru-4646bis and draft-ietf-ltru-matching

Richard Ishida
On 29/08/2019 14:44, John Cowan wrote:

> On Thu, Aug 29, 2019 at 5:00 AM Florian Rivoal <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>
>     In case that influences what you have to say, note that what I
>     intend to do is not to store the canonicalized-to-extlang form
>     anywhere. It would only be for internal processing: when performing
>     an extended filtering operation, where it is unknown whether the
>     ranges and tags are in extlang form or not, canonicalize both to
>     extlang form do the extended filtering operation on that.
>
>
> In that case you can equally canonicalize away from the extlang form as
> toward it.  I recommend that.


... which is what i was thinking too.

ri

_______________________________________________
Ltru mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/ltru
Reply | Threaded
Open this post in threaded view
|

Re: Mail regarding draft-ietf-ltru-4646bis and draft-ietf-ltru-matching

Florian Rivoal
In reply to this post by John Cowan


> On Aug 29, 2019, at 22:44, John Cowan <[hidden email]> wrote:
>
>
>
> On Thu, Aug 29, 2019 at 5:00 AM Florian Rivoal <[hidden email]> wrote:
>
> In case that influences what you have to say, note that what I intend to do is not to store the canonicalized-to-extlang form anywhere. It would only be for internal processing: when performing an extended filtering operation, where it is unknown whether the ranges and tags are in extlang form or not, canonicalize both to extlang form do the extended filtering operation on that.
>
> In that case you can equally canonicalize away from the extlang form as toward it.  I recommend that.

Can you?

Let's say you want to match (using extended filtering) the zh range against documents that may contain the zh-yue or yue tags (and possibly other zh-cmn, zh-hakka, zh, zh-HK…). This could be something a typesetter wants to do to use a particular font and set of line breaking rules for any chunk of Chinese (in the broad sense) text.

If we canonicalize to extlang form:
  zh -> zh
  zh-yue -> zh-yue
  yue -> zh-yue
Therefore, the zh range will match both the documents that contained zh-yue or yue. This is what I want.

If we canonicalize away from extlang form:
  zh -> zh
  zh-yue -> yue
  yue -> yue
Therefore, the zh range will match neither documents that contained zh-yue nor yue. This is not what I want, and is worse than not canonicalizing at all.

So it seems to me that no, we cannot canonicalize away from the extlang form and get the same results.

If the extended filtering operation did something smart with macrolanguages, then I wouldn't need canonicalization at all, but it doesn't, so I feel I need to canonicalize, and as described above, only canonicalization to extlang actually seems to help.

Am I missing something?

—Florian

_______________________________________________
Ltru mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/ltru
Reply | Threaded
Open this post in threaded view
|

Re: Mail regarding draft-ietf-ltru-4646bis and draft-ietf-ltru-matching

Florian Rivoal


> On Aug 30, 2019, at 10:08, Florian Rivoal <[hidden email]> wrote:
>
>
>
>> On Aug 29, 2019, at 22:44, John Cowan <[hidden email]> wrote:
>>
>>
>>
>> On Thu, Aug 29, 2019 at 5:00 AM Florian Rivoal <[hidden email]> wrote:
>>
>> In case that influences what you have to say, note that what I intend to do is not to store the canonicalized-to-extlang form anywhere. It would only be for internal processing: when performing an extended filtering operation, where it is unknown whether the ranges and tags are in extlang form or not, canonicalize both to extlang form do the extended filtering operation on that.
>>
>> In that case you can equally canonicalize away from the extlang form as toward it.  I recommend that.
>
> Can you?
>
> Let's say you want to match (using extended filtering) the zh range against documents that may contain the zh-yue or yue tags (and possibly other zh-cmn, zh-hakka, zh, zh-HK…). This could be something a typesetter wants to do to use a particular font and set of line breaking rules for any chunk of Chinese (in the broad sense) text.
>
> If we canonicalize to extlang form:
>  zh -> zh
>  zh-yue -> zh-yue
>  yue -> zh-yue
> Therefore, the zh range will match both the documents that contained zh-yue or yue. This is what I want.
>
> If we canonicalize away from extlang form:
>  zh -> zh
>  zh-yue -> yue
>  yue -> yue
> Therefore, the zh range will match neither documents that contained zh-yue nor yue. This is not what I want, and is worse than not canonicalizing at all.
>
> So it seems to me that no, we cannot canonicalize away from the extlang form and get the same results.
>
> If the extended filtering operation did something smart with macrolanguages, then I wouldn't need canonicalization at all, but it doesn't, so I feel I need to canonicalize, and as described above, only canonicalization to extlang actually seems to help.
>
> Am I missing something?
>
> —Florian

Sorry for not including that in the previous message, but to give another example, if I want to use the no range to match any of: no, no-bok, no-nyn, nb, or nn, canonicalization to extlang form works, and canonicalization away from it doesn't.

—Florian
_______________________________________________
Ltru mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/ltru
Reply | Threaded
Open this post in threaded view
|

Re: Mail regarding draft-ietf-ltru-4646bis and draft-ietf-ltru-matching

John Cowan
Got it. Yes, with extended filtering, you need to canonicalize toward extlangs.  I thought you were talking about simple equality matching after canonicalization.

Note that almost all macrolanguages do not have extlang tags; the languages can be identified by the macrolanguage or by the individual language tag, but not both.  Indeed, only ar, kok, lv, ms, sw, uz, zh have extlang tags subordinated to them; sgn does too but is not a macrolanguage.




On Thu, Aug 29, 2019 at 9:14 PM Florian Rivoal <[hidden email]> wrote:


> On Aug 30, 2019, at 10:08, Florian Rivoal <[hidden email]> wrote:
>
>
>
>> On Aug 29, 2019, at 22:44, John Cowan <[hidden email]> wrote:
>>
>>
>>
>> On Thu, Aug 29, 2019 at 5:00 AM Florian Rivoal <[hidden email]> wrote:
>>
>> In case that influences what you have to say, note that what I intend to do is not to store the canonicalized-to-extlang form anywhere. It would only be for internal processing: when performing an extended filtering operation, where it is unknown whether the ranges and tags are in extlang form or not, canonicalize both to extlang form do the extended filtering operation on that.
>>
>> In that case you can equally canonicalize away from the extlang form as toward it.  I recommend that.
>
> Can you?
>
> Let's say you want to match (using extended filtering) the zh range against documents that may contain the zh-yue or yue tags (and possibly other zh-cmn, zh-hakka, zh, zh-HK…). This could be something a typesetter wants to do to use a particular font and set of line breaking rules for any chunk of Chinese (in the broad sense) text.
>
> If we canonicalize to extlang form:
>  zh -> zh
>  zh-yue -> zh-yue
>  yue -> zh-yue
> Therefore, the zh range will match both the documents that contained zh-yue or yue. This is what I want.
>
> If we canonicalize away from extlang form:
>  zh -> zh
>  zh-yue -> yue
>  yue -> yue
> Therefore, the zh range will match neither documents that contained zh-yue nor yue. This is not what I want, and is worse than not canonicalizing at all.
>
> So it seems to me that no, we cannot canonicalize away from the extlang form and get the same results.
>
> If the extended filtering operation did something smart with macrolanguages, then I wouldn't need canonicalization at all, but it doesn't, so I feel I need to canonicalize, and as described above, only canonicalization to extlang actually seems to help.
>
> Am I missing something?
>
> —Florian

Sorry for not including that in the previous message, but to give another example, if I want to use the no range to match any of: no, no-bok, no-nyn, nb, or nn, canonicalization to extlang form works, and canonicalization away from it doesn't.

—Florian

_______________________________________________
Ltru mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/ltru
Reply | Threaded
Open this post in threaded view
|

Re: Mail regarding draft-ietf-ltru-4646bis and draft-ietf-ltru-matching

Florian Rivoal


On Aug 30, 2019, at 11:17, John Cowan <[hidden email]> wrote:

Got it. Yes, with extended filtering, you need to canonicalize toward extlangs.  I thought you were talking about simple equality matching after canonicalization.

Note that almost all macrolanguages do not have extlang tags; the languages can be identified by the macrolanguage or by the individual language tag, but not both.  Indeed, only ar, kok, lv, ms, sw, uz, zh have extlang tags subordinated to them;

Ah, indeed, I had failed to notice that. That doesn't make what I am proposing to do wrong, but it does limit its usefulness to a subset of the macrolanguage families. So my Chinese example would work, but my Norwegian one wouldn't.

Is there a reason for the lack of extlang on these macrolanguages that don't have it? inferring the equivalent information by inverting the "Prefered value" information doesn't sound hard, but if it wasn't done, presumably there's a reason.

sgn does too but is not a macrolanguage.

It doesn't sound terribly useful to me to do the extlang canonicalization for the sake of sgn, but neither does it sound harmful.

—Florian


On Thu, Aug 29, 2019 at 9:14 PM Florian Rivoal <[hidden email]> wrote:


> On Aug 30, 2019, at 10:08, Florian Rivoal <[hidden email]> wrote:
>
>
>
>> On Aug 29, 2019, at 22:44, John Cowan <[hidden email]> wrote:
>>
>>
>>
>> On Thu, Aug 29, 2019 at 5:00 AM Florian Rivoal <[hidden email]> wrote:
>>
>> In case that influences what you have to say, note that what I intend to do is not to store the canonicalized-to-extlang form anywhere. It would only be for internal processing: when performing an extended filtering operation, where it is unknown whether the ranges and tags are in extlang form or not, canonicalize both to extlang form do the extended filtering operation on that.
>>
>> In that case you can equally canonicalize away from the extlang form as toward it.  I recommend that.
>
> Can you?
>
> Let's say you want to match (using extended filtering) the zh range against documents that may contain the zh-yue or yue tags (and possibly other zh-cmn, zh-hakka, zh, zh-HK…). This could be something a typesetter wants to do to use a particular font and set of line breaking rules for any chunk of Chinese (in the broad sense) text.
>
> If we canonicalize to extlang form:
>  zh -> zh
>  zh-yue -> zh-yue
>  yue -> zh-yue
> Therefore, the zh range will match both the documents that contained zh-yue or yue. This is what I want.
>
> If we canonicalize away from extlang form:
>  zh -> zh
>  zh-yue -> yue
>  yue -> yue
> Therefore, the zh range will match neither documents that contained zh-yue nor yue. This is not what I want, and is worse than not canonicalizing at all.
>
> So it seems to me that no, we cannot canonicalize away from the extlang form and get the same results.
>
> If the extended filtering operation did something smart with macrolanguages, then I wouldn't need canonicalization at all, but it doesn't, so I feel I need to canonicalize, and as described above, only canonicalization to extlang actually seems to help.
>
> Am I missing something?
>
> —Florian

Sorry for not including that in the previous message, but to give another example, if I want to use the no range to match any of: no, no-bok, no-nyn, nb, or nn, canonicalization to extlang form works, and canonicalization away from it doesn't.

—Florian


_______________________________________________
Ltru mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/ltru
Reply | Threaded
Open this post in threaded view
|

Re: Mail regarding draft-ietf-ltru-4646bis and draft-ietf-ltru-matching

John Cowan
It was only done for macrolanguages that already had compound tags like zh-yue, plus a few others that seemed likely to be treated similiarly, often because they only a single written form between them.  Aymara is a macrolanguage with two individual languages, Central and Southern, but only Central has a written form and is an official language (of Bolivia).  If you don't care about the diffference, you can use ay, or you can use ayc or ayr to distinguish, but nobody has to worry about abstracting over them.



On Thu, Aug 29, 2019 at 11:55 PM Florian Rivoal <[hidden email]> wrote:


On Aug 30, 2019, at 11:17, John Cowan <[hidden email]> wrote:

Got it. Yes, with extended filtering, you need to canonicalize toward extlangs.  I thought you were talking about simple equality matching after canonicalization.

Note that almost all macrolanguages do not have extlang tags; the languages can be identified by the macrolanguage or by the individual language tag, but not both.  Indeed, only ar, kok, lv, ms, sw, uz, zh have extlang tags subordinated to them;

Ah, indeed, I had failed to notice that. That doesn't make what I am proposing to do wrong, but it does limit its usefulness to a subset of the macrolanguage families. So my Chinese example would work, but my Norwegian one wouldn't.

Is there a reason for the lack of extlang on these macrolanguages that don't have it? inferring the equivalent information by inverting the "Prefered value" information doesn't sound hard, but if it wasn't done, presumably there's a reason.

sgn does too but is not a macrolanguage.

It doesn't sound terribly useful to me to do the extlang canonicalization for the sake of sgn, but neither does it sound harmful.

—Florian


On Thu, Aug 29, 2019 at 9:14 PM Florian Rivoal <[hidden email]> wrote:


> On Aug 30, 2019, at 10:08, Florian Rivoal <[hidden email]> wrote:
>
>
>
>> On Aug 29, 2019, at 22:44, John Cowan <[hidden email]> wrote:
>>
>>
>>
>> On Thu, Aug 29, 2019 at 5:00 AM Florian Rivoal <[hidden email]> wrote:
>>
>> In case that influences what you have to say, note that what I intend to do is not to store the canonicalized-to-extlang form anywhere. It would only be for internal processing: when performing an extended filtering operation, where it is unknown whether the ranges and tags are in extlang form or not, canonicalize both to extlang form do the extended filtering operation on that.
>>
>> In that case you can equally canonicalize away from the extlang form as toward it.  I recommend that.
>
> Can you?
>
> Let's say you want to match (using extended filtering) the zh range against documents that may contain the zh-yue or yue tags (and possibly other zh-cmn, zh-hakka, zh, zh-HK…). This could be something a typesetter wants to do to use a particular font and set of line breaking rules for any chunk of Chinese (in the broad sense) text.
>
> If we canonicalize to extlang form:
>  zh -> zh
>  zh-yue -> zh-yue
>  yue -> zh-yue
> Therefore, the zh range will match both the documents that contained zh-yue or yue. This is what I want.
>
> If we canonicalize away from extlang form:
>  zh -> zh
>  zh-yue -> yue
>  yue -> yue
> Therefore, the zh range will match neither documents that contained zh-yue nor yue. This is not what I want, and is worse than not canonicalizing at all.
>
> So it seems to me that no, we cannot canonicalize away from the extlang form and get the same results.
>
> If the extended filtering operation did something smart with macrolanguages, then I wouldn't need canonicalization at all, but it doesn't, so I feel I need to canonicalize, and as described above, only canonicalization to extlang actually seems to help.
>
> Am I missing something?
>
> —Florian

Sorry for not including that in the previous message, but to give another example, if I want to use the no range to match any of: no, no-bok, no-nyn, nb, or nn, canonicalization to extlang form works, and canonicalization away from it doesn't.

—Florian


_______________________________________________
Ltru mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/ltru
Reply | Threaded
Open this post in threaded view
|

Re: Mail regarding draft-ietf-ltru-4646bis and draft-ietf-ltru-matching

Florian Rivoal


> On Aug 30, 2019, at 13:06, John Cowan <[hidden email]> wrote:
>
> It was only done for macrolanguages that already had compound tags like zh-yue, plus a few others that seemed likely to be treated similiarly, often because they only a single written form between them.  Aymara is a macrolanguage with two individual languages, Central and Southern, but only Central has a written form and is an official language (of Bolivia).  If you don't care about the diffference, you can use ay, or you can use ayc or ayr to distinguish, but nobody has to worry about abstracting over them.

Not sure I get it.

nb is the preferred form of grandfathered+deprecated no-bok, nn of no-nyn, and no is marked as a macrolanguage. Why do they have not extlang?

—Florian
_______________________________________________
Ltru mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/ltru
Reply | Threaded
Open this post in threaded view
|

Re: Mail regarding draft-ietf-ltru-4646bis and draft-ietf-ltru-matching

John Cowan
The written forms of Norwegian don't align well with particular spoken forms, unlike the national varieties of English.  In Norway there is no standard spoken language: you speak your own dialect and write either Nynorsk (10-15%) or Bokmål.  So the written form is unambiguously nb or nn, and the spoken form is unambiguously no.  When people tag a written form as no, they usually mean Bokmål, but that's not the kind of thing that extended filtering was meant to handle.

A young Russian worker travels to Norway, right after the Bolshevik Revolution. He enters a pub and finds all the occupants quarreling about something in a flaming temper. He asks one of them: “Does this mean the revolution is starting in Norway, too?” – “Oh no”, says a Norwegian; “we are just arguing about how to *spell* the word.”   (It's in fact _revolusjon_ in both.)

On Fri, Aug 30, 2019 at 12:31 AM Florian Rivoal <[hidden email]> wrote:


> On Aug 30, 2019, at 13:06, John Cowan <[hidden email]> wrote:
>
> It was only done for macrolanguages that already had compound tags like zh-yue, plus a few others that seemed likely to be treated similiarly, often because they only a single written form between them.  Aymara is a macrolanguage with two individual languages, Central and Southern, but only Central has a written form and is an official language (of Bolivia).  If you don't care about the diffference, you can use ay, or you can use ayc or ayr to distinguish, but nobody has to worry about abstracting over them.

Not sure I get it.

nb is the preferred form of grandfathered+deprecated no-bok, nn of no-nyn, and no is marked as a macrolanguage. Why do they have not extlang?

—Florian

_______________________________________________
Ltru mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/ltru
Reply | Threaded
Open this post in threaded view
|

Re: Mail regarding draft-ietf-ltru-4646bis and draft-ietf-ltru-matching

Doug Ewell-4
In reply to this post by Doug Ewell-4
 Florian Rivoal wrote:
 
> nb is the preferred form of grandfathered+deprecated no-bok, nn of
> no-nyn, and no is marked as a macrolanguage. Why do they have not
> extlang?
 
One possible reason, although it might not be "the" reason — I would
have to look into the mail archives to see if this was raised — was
that extlangs and their corresponding primary language subtags have to
be identical, and extlangs can't be two letters long, because then they
would be region subtags.
 
It might be worth noting that "no-bok" and "no-nyn" aren't real
extlangs, despite looking like them. They were created under the RFC
1766/3066 whole-tag registration model, back in 1995, before ISO 639
added code elements [nb] and [nn] in 2000. That's why they are
grandfathered (because the tags don't mean what breaking them apart into
subtags would indicate) and deprecated (because there are regular
subtags that are preferred).
 
--
Doug Ewell | Thornton, CO, US | ewellic.org
 

_______________________________________________
Ltru mailing list
[hidden email]
https://www.ietf.org/mailman/listinfo/ltru