Add data loader for HF oasst1 by grgau · Pull Request #2951 · LAION-AI/Open-Assistant (original) (raw)
In the PR to introduce RM for the dataset entry
class I forgot
that if we have RM, we'll have multiple answers per question so [Q1, (A1, A12)] but I just introduced questions and answers as list types
and therefore we could not connect a question to an answer accordingly,
e.g. questions=[Q1, Q2] and answers=[A11, A12, A21, A22] there was
no way to figure out that A11, A12 belong to Q1 and A21, A22 to
Q2. So I introduced a list[list[str] type for the answers, so that
we can connect question and answers by indices: questions=[Q1, Q2] and
answers=[[A11, A12], [A21, A22], so answers[0] belong to
questions[0]. Note that this is backwards compatible since answers
is a union type of list[str] | list[list[str]]. Also added tests for
this
Ran python check_dataset_appearances.py -d webgpt --cache_dir .cache --mode rm
and found one entry with an empty question:
DatasetEntry(questions=[''],
answers=[['Lebensraum is a German geopolitical concept that means "living space." The term was originally used to support colonialism, and was later adapted by Nazi leader Adolf Hitler to support his quest for German expansion to the east . German geographer and ethnographer Friedrich Ratzel first published an essay called "Der Lebensraum" ("The Living Space") in 1901, in which he posited that all people, animals, and plants need to expand their living space in order to survive . According to Ratzel, species that successfully adapted to one location would spread naturally to others . Hitler believed that Germany required Lebensraum in order to survive, and this conviction that this living space could be gained only in the east and, specifically, from Russia, shaped his policy after his take-over of power in Germany in 1933 . The Nazi Generalplan Ost policy (\'Master Plan for the East\') was based on the tenets of Lebensraum . It stipulated that Germany required a Lebensraum necessary for its survival and that most of the indigenous populations of Central and Eastern Europe would have to be removed permanently (either through mass deportation to Siberia, extermination, or enslavement) .', 'There are several ways to unblock blocked websites. One way is to use a good web-based proxy server . Another way is to type in the URL of the blocked site you want to access in the address bar, and then press Go or Enter . The web content will be sent to the proxy server where it can then be viewed from your device . This may make browsing a bit slower, but you should still be able to access any of your favorite websites . Another way to unblock blocked websites is to use a VPN (Virtual Private Network) . A VPN can be used to access region-restricted websites, shield your web browsing activities on public WiFi networks, and more .']],
context=None,
lang=None,
length=None,
quality=None,
humor=None,
creativity=None
)So this was the result:
'Found the following occurances in TRAIN webgpt:'
{re.compile('^[\\s\\n]*$'): ['']}