[00:57:04] RECOVERY - MariaDB Slave Lag: s5 on db2066 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [01:16:34] RECOVERY - MariaDB Slave Lag: s5 on db2045 is OK: OK slave_sql_lag Replication lag: 0.15 seconds [01:55:34] RECOVERY - MariaDB Slave Lag: s5 on db2059 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [02:36:34] RECOVERY - MariaDB Slave Lag: s5 on db2038 is OK: OK slave_sql_lag Replication lag: 0.46 seconds [03:11:34] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received: /v1/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed o [03:11:34] e was received [03:11:44] PROBLEM - eventstreams on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:12:35] RECOVERY - eventstreams on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.012 second response time [03:13:34] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [03:20:34] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:21:24] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [03:24:34] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 749.99 seconds [03:30:05] PROBLEM - Check whether ferm is active by checking the default input chain on scb1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [03:31:04] RECOVERY - Check whether ferm is active by checking the default input chain on scb1002 is OK: OK ferm input default policy is set [03:35:15] PROBLEM - puppet last run on mw2222 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [03:35:15] PROBLEM - puppet last run on mw2112 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.test],File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [03:36:44] PROBLEM - puppet last run on analytics1042 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz] [03:53:44] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 167.74 seconds [04:01:44] RECOVERY - puppet last run on analytics1042 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:05:14] RECOVERY - puppet last run on mw2222 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [04:05:14] RECOVERY - puppet last run on mw2112 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [04:05:15] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:06:14] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 1.858 second response time [04:07:55] RECOVERY - MariaDB Slave Lag: s5 on db2052 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [04:22:14] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Barack Obama) timed out before a response was received: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received [04:25:14] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve the selected anniversaries for January 15) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) timed out before a response was received [04:26:05] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [04:26:05] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [04:26:05] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Zotero alive) is CRITICAL: Could not fetch url http://10.64.16.21:1970/api: Generic connection error: HTTPConnectionPool(host=u10.64.16.21, port=1970): Max retries exceeded with url: /api?search=http%3A%2F%2Fexample.comformat=bibtex (Caused by ProtocolError(Connection aborted., BadStatusLine(,))): /api (Scrapes sample page) is CRITICAL: Could not fetch url http://1 [04:26:05] : Generic connection error: HTTPConnectionPool(host=u10.64.16.21, port=1970): Max retries exceeded with url: /api?search=http%3A%2F%2Fexample.comformat=mediawiki (Caused by ProtocolError(Connection aborted., BadStatusLine(,))) [04:30:14] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Zotero alive) is CRITICAL: Could not fetch url http://10.64.16.21:1970/api: Generic connection error: HTTPConnectionPool(host=u10.64.16.21, port=1970): Max retries exceeded with url: /api?search=http%3A%2F%2Fexample.comformat=bibtex (Caused by ProtocolError(Connection aborted., BadStatusLine(,))): /api (Scrapes sample page) is CRITICAL: Could not fetch url http://1 [04:30:16] : Generic connection error: HTTPConnectionPool(host=u10.64.16.21, port=1970): Max retries exceeded with url: /api?search=http%3A%2F%2Fexample.comformat=mediawiki (Caused by ProtocolError(Connection aborted., BadStatusLine(,))) [04:54:44] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Zotero alive) is CRITICAL: Could not fetch url http://10.64.16.21:1970/api: Generic connection error: HTTPConnectionPool(host=u10.64.16.21, port=1970): Max retries exceeded with url: /api?search=http%3A%2F%2Fexample.comformat=bibtex (Caused by ProtocolError(Connection aborted., BadStatusLine(,))): /api (Scrapes sample page) is CRITICAL: Could not fetch url http://1 [04:54:44] : Generic connection error: HTTPConnectionPool(host=u10.64.16.21, port=1970): Max retries exceeded with url: /api?search=http%3A%2F%2Fexample.comformat=mediawiki (Caused by ProtocolError(Connection aborted., BadStatusLine(,))) [04:56:54] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Zotero alive) is CRITICAL: Could not fetch url http://10.64.16.21:1970/api: Generic connection error: HTTPConnectionPool(host=u10.64.16.21, port=1970): Max retries exceeded with url: /api?search=http%3A%2F%2Fexample.comformat=bibtex (Caused by ProtocolError(Connection aborted., BadStatusLine(,))): /api (Scrapes sample page) is CRITICAL: Could not fetch url http://1 [04:56:54] : Generic connection error: HTTPConnectionPool(host=u10.64.16.21, port=1970): Max retries exceeded with url: /api?search=http%3A%2F%2Fexample.comformat=mediawiki (Caused by ProtocolError(Connection aborted., BadStatusLine(,))) [04:59:45] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Zotero alive) is CRITICAL: Could not fetch url http://10.64.16.21:1970/api: Generic connection error: HTTPConnectionPool(host=u10.64.16.21, port=1970): Max retries exceeded with url: /api?search=http%3A%2F%2Fexample.comformat=bibtex (Caused by ProtocolError(Connection aborted., BadStatusLine(,))): /api (Scrapes sample page) is CRITICAL: Could not fetch url http://1 [04:59:45] : Generic connection error: HTTPConnectionPool(host=u10.64.16.21, port=1970): Max retries exceeded with url: /api?search=http%3A%2F%2Fexample.comformat=mediawiki (Caused by ProtocolError(Connection aborted., BadStatusLine(,))) [04:59:54] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Barack Obama) timed out before a response was received [05:00:44] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [05:23:14] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Zotero alive) is CRITICAL: Could not fetch url http://10.64.16.21:1970/api: Generic connection error: HTTPConnectionPool(host=u10.64.16.21, port=1970): Max retries exceeded with url: /api?search=http%3A%2F%2Fexample.comformat=bibtex (Caused by ProtocolError(Connection aborted., BadStatusLine(,))): /api (Scrapes sample page) is CRITICAL: Could not fetch url http://1 [05:23:14] : Generic connection error: HTTPConnectionPool(host=u10.64.16.21, port=1970): Max retries exceeded with url: /api?search=http%3A%2F%2Fexample.comformat=mediawiki (Caused by ProtocolError(Connection aborted., BadStatusLine(,))) [05:27:15] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Zotero alive) is CRITICAL: Could not fetch url http://10.64.16.21:1970/api: Generic connection error: HTTPConnectionPool(host=u10.64.16.21, port=1970): Max retries exceeded with url: /api?search=http%3A%2F%2Fexample.comformat=bibtex (Caused by ProtocolError(Connection aborted., BadStatusLine(,))): /api (Scrapes sample page) is CRITICAL: Could not fetch url http://1 [05:27:15] : Generic connection error: HTTPConnectionPool(host=u10.64.16.21, port=1970): Max retries exceeded with url: /api?search=http%3A%2F%2Fexample.comformat=mediawiki (Caused by ProtocolError(Connection aborted., BadStatusLine(,))) [06:15:04] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Zotero alive) is CRITICAL: Could not fetch url http://10.64.16.21:1970/api: Generic connection error: HTTPConnectionPool(host=u10.64.16.21, port=1970): Max retries exceeded with url: /api?search=http%3A%2F%2Fexample.comformat=bibtex (Caused by ProtocolError(Connection aborted., BadStatusLine(,))): /api (Scrapes sample page) is CRITICAL: Could not fetch url http://1 [06:15:04] : Generic connection error: HTTPConnectionPool(host=u10.64.16.21, port=1970): Max retries exceeded with url: /api?search=http%3A%2F%2Fexample.comformat=mediawiki (Caused by ProtocolError(Connection aborted., BadStatusLine(,))) [06:27:15] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Zotero alive) is CRITICAL: Could not fetch url http://10.64.16.21:1970/api: Generic connection error: HTTPConnectionPool(host=u10.64.16.21, port=1970): Max retries exceeded with url: /api?search=http%3A%2F%2Fexample.comformat=bibtex (Caused by ProtocolError(Connection aborted., BadStatusLine(,))): /api (Scrapes sample page) is CRITICAL: Could not fetch url http://1 [06:27:15] : Generic connection error: HTTPConnectionPool(host=u10.64.16.21, port=1970): Max retries exceeded with url: /api?search=http%3A%2F%2Fexample.comformat=mediawiki (Caused by ProtocolError(Connection aborted., BadStatusLine(,))) [06:41:25] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Zotero alive) is CRITICAL: Could not fetch url http://10.64.16.21:1970/api: Generic connection error: HTTPConnectionPool(host=u10.64.16.21, port=1970): Max retries exceeded with url: /api?search=http%3A%2F%2Fexample.comformat=bibtex (Caused by ProtocolError(Connection aborted., BadStatusLine(,))): /api (Scrapes sample page) is CRITICAL: Could not fetch url http://1 [06:41:25] : Generic connection error: HTTPConnectionPool(host=u10.64.16.21, port=1970): Max retries exceeded with url: /api?search=http%3A%2F%2Fexample.comformat=mediawiki (Caused by ProtocolError(Connection aborted., BadStatusLine(,))) [06:43:44] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Zotero alive) is CRITICAL: Could not fetch url http://10.64.16.21:1970/api: Generic connection error: HTTPConnectionPool(host=u10.64.16.21, port=1970): Max retries exceeded with url: /api?search=http%3A%2F%2Fexample.comformat=bibtex (Caused by ProtocolError(Connection aborted., BadStatusLine(,))): /api (Scrapes sample page) is CRITICAL: Could not fetch url http://1 [06:43:44] : Generic connection error: HTTPConnectionPool(host=u10.64.16.21, port=1970): Max retries exceeded with url: /api?search=http%3A%2F%2Fexample.comformat=mediawiki (Caused by ProtocolError(Connection aborted., BadStatusLine(,))) [06:45:24] PROBLEM - cxserver endpoints health on scb1002 is CRITICAL: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received: /v1/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [06:45:24] PROBLEM - graphoid endpoints health on scb1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received [06:46:15] RECOVERY - graphoid endpoints health on scb1002 is OK: All endpoints are healthy [06:46:24] RECOVERY - cxserver endpoints health on scb1002 is OK: All endpoints are healthy [06:46:34] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Zotero alive) is CRITICAL: Could not fetch url http://10.64.16.21:1970/api: Generic connection error: HTTPConnectionPool(host=u10.64.16.21, port=1970): Max retries exceeded with url: /api?search=http%3A%2F%2Fexample.comformat=bibtex (Caused by ProtocolError(Connection aborted., BadStatusLine(,))): /api (Scrapes sample page) is CRITICAL: Could not fetch url http://1 [06:46:34] : Generic connection error: HTTPConnectionPool(host=u10.64.16.21, port=1970): Max retries exceeded with url: /api?search=http%3A%2F%2Fexample.comformat=mediawiki (Caused by ProtocolError(Connection aborted., BadStatusLine(,))) [06:50:35] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Zotero alive) is CRITICAL: Could not fetch url http://10.64.16.21:1970/api: Generic connection error: HTTPConnectionPool(host=u10.64.16.21, port=1970): Max retries exceeded with url: /api?search=http%3A%2F%2Fexample.comformat=bibtex (Caused by ProtocolError(Connection aborted., BadStatusLine(,))): /api (Scrapes sample page) is CRITICAL: Could not fetch url http://1 [06:50:35] : Generic connection error: HTTPConnectionPool(host=u10.64.16.21, port=1970): Max retries exceeded with url: /api?search=http%3A%2F%2Fexample.comformat=mediawiki (Caused by ProtocolError(Connection aborted., BadStatusLine(,))) [06:52:29] looking ^ [07:01:33] !log mobrovac@tin Started restart [electron-render/deploy@8dd5f13]: electron stuck - T174916 [07:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:41] T174916: electron/pdfrender hangs - https://phabricator.wikimedia.org/T174916 [07:09:27] !log mobrovac@tin Started restart [zotero/translators@a0c41c3]: Zotero eating up memory [07:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:24] this was a trifecta: electron hanging, zotero eating up mem and trending edits losing its offset yet again [07:13:52] got to love such sunday mornings [07:17:04] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [09:28:37] 10Operations, 10Developer-Relations, 10cloud-services-team (Kanban): Create discourse-mediawiki.wmflabs.org - https://phabricator.wikimedia.org/T180854#3772732 (10Qgil) >>! In T180854#3771558, @Tgr wrote: > SSO task is T124691, should probably be a blocker. Blocker for the pilot or for {T180853}? > That ce... [12:08:43] 10Operations, 10DBA, 10Wikimedia-Site-requests: Rename user "Arsog1985" to "Sigma'am" on Central Auth - https://phabricator.wikimedia.org/T180903#3772847 (10Linedwell) [12:09:04] 10Operations, 10DBA, 10Wikimedia-Site-requests: Rename user "Arsog1985" to "Sigma'am" on Central Auth - https://phabricator.wikimedia.org/T180903#3772859 (10Linedwell) [12:45:49] 10Operations, 10DBA, 10Wikimedia-Site-requests: Rename user "Arsog1985" to "Sigma'am" on Central Auth : supervision needed - https://phabricator.wikimedia.org/T180903#3772878 (10Framawiki) [13:29:07] could someone reset the topic? [13:32:06] thanks! [13:34:57] 10Operations, 10Wikimedia-Site-requests: {{NUMBEROFARTICLES}} is too low in din.wikipedia.org - https://phabricator.wikimedia.org/T180905#3772918 (10Amire80) [13:36:20] your welcome :). [14:59:22] (03PS8) 10Paladox: javascript: Remove the npm package [puppet] - 10https://gerrit.wikimedia.org/r/386889 [14:59:35] (03Abandoned) 10Paladox: javascript: Remove the npm package [puppet] - 10https://gerrit.wikimedia.org/r/386889 (owner: 10Paladox) [15:01:28] (03CR) 10Gehel: [C: 04-1] Gerrit: Fix up logstash configuation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/392079 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [15:02:07] (03CR) 10Paladox: Gerrit: Fix up logstash configuation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/392079 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [15:04:00] (03CR) 10Paladox: [C: 031] Gerrit: Fix up logstash configuation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/392079 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [15:05:18] gehel it's done pursposely it seems. I want to add support for setting socket in gerrit with java code but it wont be in 2.13 and may not be in 2.14. [15:05:51] i am just trying to think how we can wrap async around the socket one in java. [15:06:16] though for now we have to do it that way until we get it into gerrit's core. [15:11:43] paladox: the code you link checks the existence of the config file and bails if it cannot create the log directory. Looking at that code, it I don't see any reason why the log files themselves should already exist [15:11:55] it checks for [15:12:04] log4j.configuation [15:12:08] in system properties [15:12:14] if it exists, it dosent execute the code [15:12:42] gehel https://github.com/GerritCodeReview/gerrit/blob/09786353f76b778a76a61a092adf60a41fbc3cfd/java/com/google/gerrit/server/util/SystemLog.java#L54 [15:12:52] LOG4J_CONFIGURATION = "log4j.configuration"; [15:13:32] ah here [15:13:32] https://github.com/GerritCodeReview/gerrit/blob/09786353f76b778a76a61a092adf60a41fbc3cfd/java/com/google/gerrit/server/util/SystemLog.java#L79 [15:13:37] ah = and [15:15:51] 10Operations, 10Wikimedia-Mailing-lists: Let public archives be indexed and archived - https://phabricator.wikimedia.org/T90407#3773043 (10Nemo_bis) +CC per https://lists.wikimedia.org/pipermail/wikitech-l/2017-November/089137.html [15:16:01] paladox: I have to leave (again), but can you try and make sure gerrit actually fails to log if the log files are non-existing? I'll probably be back for a bit later today... [15:16:19] ok [15:16:32] gehel i've confirmed that gerit fails to log if the file does not exist [15:17:23] How does it fails? [15:17:36] it shows in /var/log/syslog that the file does not exist [15:17:45] we had the wrong path specified for gc_log [15:17:47] apparently [15:18:16] so it was trying /var/log/gerrit/gc_log [15:18:46] i fixed that now. but it was logging to syslog saying that path did not exist [15:19:46] [15:19:54] um [15:20:06] never mind, it seems to create the file but not the directory if needed [15:20:11] sorry for spam [15:20:59] (03PS16) 10Paladox: Gerrit: Fix up logstash configuation [puppet] - 10https://gerrit.wikimedia.org/r/392079 (https://phabricator.wikimedia.org/T141324) [15:54:07] (03Draft1) 10Paladox: planet: Update template / css / item look [puppet] - 10https://gerrit.wikimedia.org/r/389498 [15:54:11] (03Draft2) 10Paladox: planet: Update template / css / item look [puppet] - 10https://gerrit.wikimedia.org/r/389498 [15:54:14] (03Draft3) 10Paladox: planet: Update template / css / item look [puppet] - 10https://gerrit.wikimedia.org/r/389498 [15:54:18] (03PS4) 10Paladox: planet: Update template / css / item look [puppet] - 10https://gerrit.wikimedia.org/r/389498 [17:45:55] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [17:47:24] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:54:25] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:56:04] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:56:44] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [18:06:27] the 5xx spike is ores related, commented in https://phabricator.wikimedia.org/T179712#3773092 [18:15:02] Zayo port on cr2-eqiad down, but no related downtime announced afaics. ---^ [18:15:09] Cc: XioNoX [18:30:27] (03Draft2) 10Jayprakash12345: Enable wgNamespacesWithSubpages for hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392312 [18:30:54] (03PS3) 10Jayprakash12345: Enable wgNamespacesWithSubpages for hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392312 (https://phabricator.wikimedia.org/T180913) [18:31:07] elukey: not an issue, we have plenty of capacity. I will open a ticket if it's not solved by the time I get to my laptop in a few hours [18:32:43] XioNoX: sure sure, I just wanted to ping you and get your opinion, thanks! :) [18:56:15] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1952 bytes in 0.090 second response time [19:00:15] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] [19:38:06] (03PS1) 10Marostegui: db-eqiad.php: Load 0 for db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392315 [19:39:36] (03CR) 10Jcrespo: [C: 04-1] "Load 9 doesn't depool a server." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392315 (owner: 10Marostegui) [19:40:05] (03CR) 10Jcrespo: [C: 04-1] "I meant load 0" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392315 (owner: 10Marostegui) [19:40:27] (03PS2) 10Marostegui: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392315 [19:40:51] jynus: I didn't realise it had replication broken, that is why I set 0 [19:42:57] (03PS1) 10Jcrespo: mariadb: Depool db1100, pool db1071 instead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392316 [19:44:06] what do you think of https://gerrit.wikimedia.org/r/#/c/392316/1/wmf-config/db-eqiad.php ? [19:44:29] I didn't chose db1071 in case you wanted it for testing during the week [19:44:39] I don't mind if you prefer your patch or my patch [19:44:58] I would set weight 0 though to db1071 [19:45:04] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received [19:45:06] if you want to push your patch [19:45:16] that would be my only comment about it [19:45:25] with load 0 it will be still pinged every time [19:45:33] it is how shitty is the load balancer [19:45:38] \o/ [19:45:51] then, whatever patch you prefer I don't mind [19:45:55] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy [19:46:59] also probably how bad wikidata code is [19:47:07] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1100, pool db1071 instead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392316 (owner: 10Jcrespo) [19:47:19] (03Abandoned) 10Marostegui: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392315 (owner: 10Marostegui) [19:47:23] (03CR) 10jenkins-bot: mariadb: Depool db1100, pool db1071 instead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392316 (owner: 10Jcrespo) [19:50:03] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1100 (duration: 00m 49s) [19:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:33] I do not know why we have a software load balancer [19:51:34] Interesting, this server crashed already: https://gerrit.wikimedia.org/r/#/c/378193/ so probabbly a rebuild is a good idea [19:51:45] if if a server goes trivially down [19:52:00] the servers keeps being queried [19:52:09] yeah, that is pretty terrible :( [19:52:42] and the only thing keeping the server up [19:52:46] is the query kille [19:52:47] r [19:56:15] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1948 bytes in 0.117 second response time [19:57:35] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [20:10:44] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [20:21:57] (03PS1) 10Krinkle: noc: Link to Grafana instead of Ganglia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392317 [20:22:07] (03CR) 10Krinkle: [C: 032] noc: Link to Grafana instead of Ganglia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392317 (owner: 10Krinkle) [20:23:20] (03Merged) 10jenkins-bot: noc: Link to Grafana instead of Ganglia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392317 (owner: 10Krinkle) [20:23:30] (03CR) 10jenkins-bot: noc: Link to Grafana instead of Ganglia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392317 (owner: 10Krinkle) [20:28:18] !log krinkle@tin Synchronized docroot/noc/index.html: noc: Link to Grafana (duration: 00m 49s) [20:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:38] Krinkle, [20:41:41] A database query error has occurred. This may indicate a bug in the software.[WhHsMApAAD0AAFkrelgAAAAA] 2017-11-19 20:41:16: Fatal exception of type "Wikimedia\Rdbms\DBQueryTimeoutError" [20:42:45] TabbyCat: on noc ? [20:42:54] on production [20:43:24] I was also puzzled a Meddiawiki error should occur on noc. [20:44:01] filed https://phabricator.wikimedia.org/T180919 [20:44:20] a query too slow [20:44:59] I can't repro visiting your url [20:45:12] (it correctly prints the user contributions of the maintenance script) [20:45:31] special:log not special:contribs [20:45:41] oh fun [20:45:43] /wiki/Special:Log/Maintenance_script [20:45:47] is slow / buggy [20:46:10] (oh, nevermind, I was testing the referrer URL previously) [21:04:55] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead returned the unexpected status 504 (expecting: 200) [21:06:04] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [21:15:23] (03CR) 10Zoranzoki21: [C: 031] "Looks good to me, but someone else must approve" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392312 (https://phabricator.wikimedia.org/T180913) (owner: 10Jayprakash12345) [21:19:22] (03PS5) 10Paladox: planet: Update template / css / item look [puppet] - 10https://gerrit.wikimedia.org/r/389498 [21:23:10] 10Operations, 10Developer-Relations, 10cloud-services-team (Kanban): Create discourse-mediawiki.wmflabs.org - https://phabricator.wikimedia.org/T180854#3773265 (10Tgr) Blocker for production, I mean. [21:57:55] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead returned the unexpected status 504 (expecting: 200) [21:59:04] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [21:59:48] (03CR) 10MarcoAurelio: [C: 031] Enable wgNamespacesWithSubpages for hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392312 (https://phabricator.wikimedia.org/T180913) (owner: 10Jayprakash12345) [22:05:16] 10Operations, 10Analytics, 10Research, 10Traffic, and 2 others: Referrer policy for browsers which only support the old spec - https://phabricator.wikimedia.org/T180921#3773297 (10Tgr) [22:06:57] 10Operations, 10Analytics, 10Research, 10Traffic, and 2 others: Referrer policy for browsers which only support the old spec - https://phabricator.wikimedia.org/T180921#3773313 (10Tgr) [22:17:24] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections returned the unexpected status 504 (expecting: 200) [22:18:24] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [22:22:49] 10Operations, 10Wikimedia-Planet, 10Patch-For-Review: planet.wikimedia.org: replace planet-venus software with rawdog - https://phabricator.wikimedia.org/T180498#3773316 (10Paladox) [22:40:57] (03PS6) 10Paladox: planet: Improve look and configuation updates [puppet] - 10https://gerrit.wikimedia.org/r/389498 (https://phabricator.wikimedia.org/T180498) [22:46:44] 10Operations, 10Developer-Relations: Bring discourse.mediawiki.org to production - https://phabricator.wikimedia.org/T180853#3773320 (10Qgil) [23:20:09] !log removed 2FA for Ask21 T180889 [23:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:18] T180889: Disable 2FA for Ask21 - https://phabricator.wikimedia.org/T180889 [23:51:37] 10Operations, 10I18n: Publish full fallback sequence for generic families (sans, serif) in SVG font rendering - https://phabricator.wikimedia.org/T180923#3773351 (10Arthur2e5) [23:54:20] 10Operations, 10Analytics, 10Research, 10Traffic, and 4 others: Referrer policy for browsers which only support the old spec - https://phabricator.wikimedia.org/T180921#3773364 (10gh87) [23:55:47] 10Operations, 10I18n: Publish full fallback sequence for generic families (sans, serif) in SVG font rendering - https://phabricator.wikimedia.org/T180923#3773365 (10Arthur2e5) [23:58:34] PROBLEM - Long running screen/tmux on graphite1001 is CRITICAL: CRIT: Long running SCREEN process. (PID: 36516, 1734464s 1728000s).