Commit 7cc2071c authored by amrabdou's avatar amrabdou
Browse files

[issue, #32] Definition einer sinnvollen JSON-Konfiguration mit konkreten Beispielen

   - Added JSON Structure for Test data from #27
   - Added README to explain how the structure works
   - Used CSS_SELECTOR as part of the ways to find data
parent f9b21061
Pipeline #69867 passed with stages
in 18 minutes and 25 seconds
# JSON Configurations
There are two types of data to be processed:
1. Word files ending with `.docx`
2. FAQ Websites:
1. Websites with Questions and Answers as Text in the same page
2. Websites with Questions as texts in the same page, and a link to the Answer on a separate page
### `.docx` JSON structure
```
{
"type": "doc", # type either "doc" or "url"
"name": "htc_sync_manager_faq.docx", "name of the file with its extension"
"question" : {
"style": "normal", # value "normal", "heading1", "heading2" etc.
"format": "bold", # the format of the text "italic", "bold" "strikethrough" etc.
"size": 12 # font size
}
}
```
### Websites with Question-Answer on same page
```
{
"type": "url", # type either "doc" or "url"
"name": "https://www.berlin.de/sen/finanzen/steuern/informationen-fuer-steuerzahler-/faq-steuern/artikel.697552.php", # URI of the website
"question" : {
"type": "text", # "text" indicates its html on the same site
"css_selector": "div.block a" # css selector to the question, you can check it using the webconsole document.querySelector(css_selector)
},
"answer": {
"type": "text", # same as above
"css_selector": "div.block .text" # same as above
}
}
```
### Websites with Question on the same page, answer on a separate page
```
{
"type": "url", # same as above
"name": "https://www.stvo.de/info/faq", # same as above
"question" : {
"type": "text", # same as above
"css_selector": "table.category > tbody td.list-title > a", # same as above
},
"answer": {
"type": "href", # link to answer found in the href attribute of the href_css_selector
"href_css_selector": "table.category > tbody td.list-title > a", # path to the link, found in the href attribute of the object if selected using document.querySelector(href_css_selector)
"css_selector": "div.item-page > div:nth-child(5)" # where to find the text data of the answer on the new page
}
}
```
### Notice
Of the 5 sites listed under `test-data/PSE 5 Websiten.txt`, the url `https://de-de.facebook.com/business/faq/` has no config json
as facebook doesn't have static html to parse and the content is dynamic so it makes parsing using only html not possible.
\ No newline at end of file
{
"type": "doc",
"name": "forex_freiheit_krypto_faq.docx",
"question" : {
"style": "normal",
"format": "bold",
"size": 12
}
}
\ No newline at end of file
{
"type": "doc",
"name": "htc_sync_manager_faq.docx",
"question" : {
"style": "normal",
"format": "bold",
"size": 12
}
}
\ No newline at end of file
{
"type": "doc",
"name": "htc_sync_manager_faq.docx",
"question" : {
"style": "normal",
"type": "bold",
"size": 12
}
}
\ No newline at end of file
{
"type": "url",
"name": "https://www.berlin.de/sen/finanzen/steuern/informationen-fuer-steuerzahler-/faq-steuern/artikel.697552.php",
"question" : {
"type": "text",
"css_selector": "div.block a"
},
"answer": {
"type": "text",
"css_selector": "div.block .text"
}
}
\ No newline at end of file
{
"type": "url",
"name": "https://www.dge.de/wissenschaft/weitere-publikationen/faqs/?L=0",
"question" : {
"type": "text",
"css_selector": ".news-list-container .news-list-item h2",
},
"answer": {
"type": "href",
"href_css_selector": ".news-list-container .news-list-item h2 a",
"css_selector": "#c10287 > h3"
}
}
\ No newline at end of file
{
"type": "url",
"name": "https://www.kletterfabrik.koeln/faq.html",
"question" : {
"type": "text",
"css_selector": "div.toggler h3"
},
"answer": {
"type": "text",
"css_selector": "section.ce_accordion div.accordion > div"
}
}
\ No newline at end of file
{
"type": "url",
"name": "https://www.stvo.de/info/faq",
"question" : {
"type": "text",
"css_selector": "table.category > tbody td.list-title > a",
},
"answer": {
"type": "href",
"href_css_selector": "table.category > tbody td.list-title > a",
"css_selector": "div.item-page > div:nth-child(5)"
}
}
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment