[00:32:24] (03PS3) 10TerraCodes: Update InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) [00:33:41] (03PS4) 10TerraCodes: Remove overlapping userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) [00:43:29] PROBLEM - Disk space on graphite1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [01:54:59] PROBLEM - Disk space on graphite1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [02:31:09] 10Operations, 10Ops-Access-Requests: Requesting access to contint-admins for addshore - https://phabricator.wikimedia.org/T173233#3521778 (10Addshore) [02:31:28] 10Operations, 10Ops-Access-Requests, 10User-Addshore: Requesting access to contint-admins for addshore - https://phabricator.wikimedia.org/T173233#3521792 (10Addshore) [02:37:32] 10Operations, 10Ops-Access-Requests, 10Release-Engineering-Team, 10User-Addshore: Make @daniel a MediaWiki deployer - https://phabricator.wikimedia.org/T173230#3521809 (10Addshore) [02:45:10] PROBLEM - Disk space on graphite1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [02:58:23] 10Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 10User-Addshore: Requesting access to contint-admins for addshore - https://phabricator.wikimedia.org/T173233#3521821 (10Legoktm) Yes please. I made addshore download and install jenkins-job-builder assuming he could deploy chang... [03:25:55] (03CR) 10Greg Grossmeier: [C: 031] Make daniel a deployer [puppet] - 10https://gerrit.wikimedia.org/r/371661 (https://phabricator.wikimedia.org/T173230) (owner: 10Reedy) [03:28:10] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 688.23 seconds [03:29:36] (03PS1) 10Greg Grossmeier: Add addshore to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/371663 (https://phabricator.wikimedia.org/T173233) [03:30:35] 10Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10User-Addshore: Requesting access to contint-admins for addshore - https://phabricator.wikimedia.org/T173233#3521846 (10greg) (obvious +1 from me) [03:33:19] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [03:33:39] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [04:00:09] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [04:01:49] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:13:15] (03CR) 10Alex Monk: [C: 031] Make daniel a deployer [puppet] - 10https://gerrit.wikimedia.org/r/371661 (https://phabricator.wikimedia.org/T173230) (owner: 10Reedy) [04:25:29] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 92.33 seconds [05:20:42] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 306 bytes in 0.001 second response time [05:24:09] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [05:25:51] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.002 second response time [05:27:39] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [05:38:20] PROBLEM - Disk space on graphite1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:40:40] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [05:42:20] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:43:40] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:59:29] PROBLEM - Disk space on graphite1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:02:40] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [06:04:30] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 4888 bytes in 2.344 second response time [06:28:09] PROBLEM - graphite.wikimedia.org on graphite1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [06:29:09] RECOVERY - graphite.wikimedia.org on graphite1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 0.015 second response time [07:14:19] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [07:14:59] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 4888 bytes in 0.022 second response time [08:18:39] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [08:20:39] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 4888 bytes in 19.818 second response time [09:15:10] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [09:18:59] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 4888 bytes in 0.036 second response time [09:48:59] PROBLEM - Disk space on graphite1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:57:29] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [09:58:09] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 4888 bytes in 0.024 second response time [10:04:30] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [10:05:19] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 4888 bytes in 0.029 second response time [10:33:31] 10Operations, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, and 5 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3521978 (10Marostegui) @daniel @hoo and myself got together to talk about this yesterday and looks... [10:55:30] (03CR) 10Mark Bergsma: "This is a good effort, but I feel this needs more work. Since this is now merged, how should I provide feedback?" [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/355082 (owner: 10Giuseppe Lavagetto) [11:52:00] 10Operations, 10DBA: Decommission db1015, db1035, db1044 and db1038 - https://phabricator.wikimedia.org/T148078#3521999 (10Marostegui) I will try to run pt-table-checksum on the most important tables on s3 next week as db1015 is quite low on disk space already :-( ``` root@db1015:~# df -hT /srv Filesystem... [12:08:40] PROBLEM - Disk space on graphite1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:14:10] PROBLEM - pdfrender on scb1003 is CRITICAL: connect to address 10.64.32.153 and port 5252: Connection refused [12:36:50] PROBLEM - Disk space on graphite1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:45:29] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [12:51:19] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 4888 bytes in 0.038 second response time [12:59:59] PROBLEM - Disk space on graphite1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:46:35] 10Operations, 10Wiki-Loves-Monuments (2017): Import Wiki Loves Monuments photos from Flickr to Commons - https://phabricator.wikimedia.org/T173056#3522241 (10LilyOfTheWest) [14:56:39] PROBLEM - Disk space on graphite1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:00:49] (03PS1) 10Reedy: No one cares about AFTv5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371711 [16:16:48] (03CR) 10Aklapper: [C: 031] Phabricator: Override the frog token's label [puppet] - 10https://gerrit.wikimedia.org/r/371660 (https://phabricator.wikimedia.org/T173208) (owner: 10Greg Grossmeier) [16:34:18] PROBLEM - Disk space on graphite1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:36:19] PROBLEM - puppet last run on graphite1003 is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 2 minutes ago with 7 failures. Failed resources (up to 3 shown): Service[carbon-cache@c],Service[statsite@8127],Package[tzdata],Exec[wikidev_ensure_members] [17:03:48] PROBLEM - pdfrender on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 5252: Connection refused [17:04:48] RECOVERY - puppet last run on graphite1003 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [17:42:18] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) timed out before a response was received [17:43:08] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page v [17:43:08] timed out before a response was received [17:43:18] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [17:43:48] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received [17:43:58] PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received [17:44:08] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [17:44:09] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) timed out before a response was received [17:44:18] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received [17:44:18] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) timed out before a response was received [17:44:28] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) is WARNING: Test Get citation for Darth Vader responds with unexpected body: [0]/encyclopediaTitle = None [17:44:38] PROBLEM - graphoid endpoints health on scb1004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [17:45:18] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) timed out before a response was received [17:45:18] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received [17:45:28] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [17:45:38] PROBLEM - restbase endpoints health on cerium is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received [17:45:38] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received [17:45:38] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) timed out before a response was received [17:45:39] RECOVERY - graphoid endpoints health on scb1004 is OK: All endpoints are healthy [17:45:48] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received [17:45:48] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [17:45:48] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:58] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [17:46:08] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [17:46:09] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [17:46:09] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [17:46:09] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [17:46:18] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [17:46:18] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [17:46:18] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [17:46:18] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [17:46:28] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [17:46:38] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [17:46:49] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received [17:46:58] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [17:46:58] PROBLEM - graphoid endpoints health on scb2006 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [17:46:59] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [17:47:08] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [17:47:09] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received [17:47:28] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:47:38] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [17:47:58] PROBLEM - Graphoid LVS codfw on graphoid.svc.codfw.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [17:47:59] RECOVERY - graphoid endpoints health on scb2006 is OK: All endpoints are healthy [17:47:59] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) is WARNING: Test Get citation for Darth Vader responds with unexpected body: [0]/encyclopediaTitle = None [17:47:59] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [17:48:08] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [17:48:19] (03PS3) 10Reedy: phpcs for refresh-dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369808 [17:48:22] (03CR) 10Reedy: [C: 032] phpcs for refresh-dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369808 (owner: 10Reedy) [17:48:29] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [17:48:38] RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy [17:48:39] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [17:48:48] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [17:48:58] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [17:48:58] RECOVERY - Graphoid LVS codfw on graphoid.svc.codfw.wmnet is OK: All endpoints are healthy [17:49:28] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) timed out before a response was received [17:49:52] (03Merged) 10jenkins-bot: phpcs for refresh-dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369808 (owner: 10Reedy) [17:49:59] (03PS2) 10Reedy: No one cares about AFTv5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371711 [17:50:02] (03CR) 10Reedy: [C: 032] No one cares about AFTv5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371711 (owner: 10Reedy) [17:50:05] (03CR) 10jenkins-bot: phpcs for refresh-dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369808 (owner: 10Reedy) [17:50:18] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [17:50:28] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [17:51:19] !log reedy@tin Synchronized refresh-dblist: phpcs (duration: 00m 48s) [17:51:31] (03Merged) 10jenkins-bot: No one cares about AFTv5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371711 (owner: 10Reedy) [17:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:58] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) timed out before a response was received [17:51:58] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [17:51:58] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [17:52:08] PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received [17:52:18] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) is WARNING: Test Get citation for Darth Vader responds with unexpected body: [0]/encyclopediaTitle = None [17:52:28] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) timed out before a response was received [17:52:28] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received [17:52:28] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) timed out before a response was received [17:52:28] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [17:52:28] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received [17:52:28] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) is WARNING: Test Get citation for Darth Vader responds with unexpected body: [0]/encyclopediaTitle = None: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (ex [17:52:28] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) timed out before a response was received [17:52:29] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) timed out before a response was received [17:52:31] (03CR) 10jenkins-bot: No one cares about AFTv5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371711 (owner: 10Reedy) [17:52:46] !log reedy@tin Synchronized wmf-config/db-codfw.php: fix comment (duration: 00m 47s) [17:52:48] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/random/{format} (Random title redirect) timed out before a response was received: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) is WARNING: Test Get citation for Darth Vader responds with unexpected body: [0]/encyclopediaTitle = None [17:52:49] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) timed out before a response was received [17:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:58] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received [17:52:58] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [17:53:18] PROBLEM - graphoid endpoints health on scb1003 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [17:53:18] PROBLEM - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [17:53:28] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) is WARNING: Test Get citation for Darth Vader responds with unexpected body: [0]/encyclopediaTitle = None [17:53:28] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) timed out before a response was received [17:53:29] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) is WARNING: Test Get citation for Darth Vader responds with unexpected body: [0]/encyclopediaTitle = None: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (ex [17:53:38] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received [17:53:38] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [17:53:38] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [17:53:38] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) is WARNING: Test Get citation for Darth Vader responds with unexpected body: [0]/encyclopediaTitle = None [17:53:48] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy [17:53:54] !log reedy@tin Synchronized wmf-config/db-eqiad.php: fix comment (duration: 00m 47s) [17:53:58] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) timed out before a response was received [17:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:09] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [17:54:18] RECOVERY - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is OK: All endpoints are healthy [17:54:18] RECOVERY - graphoid endpoints health on scb1003 is OK: All endpoints are healthy [17:54:19] PROBLEM - restbase endpoints health on xenon is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [17:54:19] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [17:54:19] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [17:54:28] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [17:54:28] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [17:54:28] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy [17:54:29] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [17:54:29] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [17:54:38] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [17:54:48] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [17:54:48] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [17:54:58] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) is WARNING: Test Get citation for Darth Vader responds with unexpected body: [0]/encyclopediaTitle = None: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (G [17:54:58] phoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [17:54:58] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [17:54:58] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received [17:55:08] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [17:55:14] (03PS2) 10Reedy: Update mediawiki-codesniffer to 0.11.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367465 [17:55:19] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [17:55:28] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [17:55:29] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [17:55:29] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [17:55:38] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [17:55:38] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) is WARNING: Test Get citation for Darth Vader responds with unexpected body: [0]/encyclopediaTitle = None: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (G [17:55:38] phoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [17:55:38] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) is WARNING: Test Get citation for Darth Vader responds with unexpected body: [0]/encyclopediaTitle = None: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (G [17:55:38] phoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [17:55:38] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [17:55:49] PROBLEM - restbase endpoints health on cerium is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [17:55:58] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [17:55:58] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [17:56:08] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [17:56:28] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [17:56:38] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy [17:56:38] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [17:56:38] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [17:56:48] RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy [17:56:49] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [17:56:50] (03PS3) 10Reedy: Update mediawiki-codesniffer to 0.11.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367465 [17:56:58] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [17:57:08] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy [17:57:18] PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received [17:57:28] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) is WARNING: Test Get citation for Darth Vader responds with unexpected body: [0]/encyclopediaTitle = None: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (G [17:57:28] is there an issue going on? [17:57:28] phoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [17:57:28] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy [17:57:29] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy [17:57:38] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [17:57:38] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [17:57:48] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received [17:57:58] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) timed out before a response was received [17:58:11] I've got some reports of normal web api giving 503's [17:58:18] RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [17:58:19] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) is WARNING: Test Get citation for Darth Vader responds with unexpected body: [0]/encyclopediaTitle = None [17:58:28] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [17:58:28] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received [17:58:29] coming to the hacker room [17:58:38] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [17:58:38] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [17:58:38] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [17:58:39] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [17:58:39] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) is WARNING: Test Get citation for Darth Vader responds with unexpected body: [0]/encyclopediaTitle = None [17:58:58] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [17:58:58] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [17:58:58] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [17:58:58] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [17:58:59] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [17:59:18] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [17:59:22] hmm [17:59:28] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [17:59:28] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [17:59:28] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [17:59:28] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [17:59:38] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [17:59:38] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [17:59:38] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [17:59:38] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [18:01:08] Leon reports web api working again [18:01:24] (03CR) 10jerkins-bot: [V: 04-1] Update mediawiki-codesniffer to 0.11.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367465 (owner: 10Reedy) [18:01:43] bblack: paravoid ^ [18:03:48] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) timed out before a response was received [18:04:26] (03PS4) 10Reedy: Update mediawiki-codesniffer to 0.11.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367465 [18:04:39] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [18:05:48] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1956 bytes in 5.847 second response time [18:06:48] (03CR) 10jerkins-bot: [V: 04-1] Update mediawiki-codesniffer to 0.11.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367465 (owner: 10Reedy) [18:08:19] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [18:11:28] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:11:44] bblack: paravoid nvm, but we had similar symptoms to previous issues: restbase + mcs flapping, then musickanimal noted the api.php was 503ing (Twinkle users were complaining to him) [18:13:24] 10Operations, 10Wikidata: Wikidata.org currently very slow - https://phabricator.wikimedia.org/T173269#3522686 (10Legoktm) p:05Triage>03Unbreak! this is being investigated. Currently looking at db1082 being slow. [18:13:48] 10Operations, 10DBA, 10Wikidata: Wikidata.org currently very slow - https://phabricator.wikimedia.org/T173269#3522690 (10Legoktm) [18:13:51] (03PS5) 10Reedy: Update mediawiki-codesniffer to 0.11.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367465 [18:14:48] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:15:18] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:16:39] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:17:29] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:17:38] (03CR) 10jerkins-bot: [V: 04-1] Update mediawiki-codesniffer to 0.11.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367465 (owner: 10Reedy) [18:19:48] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1952 bytes in 3.045 second response time [18:20:33] (03PS6) 10Reedy: Update mediawiki-codesniffer to 0.11.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367465 [18:24:21] (03CR) 10jerkins-bot: [V: 04-1] Update mediawiki-codesniffer to 0.11.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367465 (owner: 10Reedy) [18:29:48] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1927 bytes in 4.428 second response time [18:36:48] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1947 bytes in 1.432 second response time [18:41:48] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1926 bytes in 0.440 second response time [18:46:54] !log bounce carbon and uwsgi on graphite1003 [18:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:38] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] [19:01:50] !log bounce pdfrender on scb1001 and scb1003 - T159922 [19:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:08] T159922: pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922 [19:02:19] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.006 second response time [19:02:28] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [19:05:41] 10Operations, 10Electron-PDFs, 10Patch-For-Review, 10Readers-Web-Backlog (Tracking), 10Services (blocked): pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3522796 (10fgiunchedi) This just happened again, any thoughts on what I wrote in T15... [19:06:39] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [19:09:31] (03PS1) 10Greg Grossmeier: Gerrit: Max batch limit = 11 [puppet] - 10https://gerrit.wikimedia.org/r/371739 [19:10:33] (03CR) 10Greg Grossmeier: "bd808 hit this limit in a patch chain today. He only needed on more and 11 is as good a number as 10." [puppet] - 10https://gerrit.wikimedia.org/r/371739 (owner: 10Greg Grossmeier) [19:10:36] 10Operations, 10Wikimedia-General-or-Unknown, 10Wikimedia-Site-requests, 10Mobile: Wikimania 2017 site does not automatically redirect to mobile site, when opening from a mobile device - https://phabricator.wikimedia.org/T120943#1865328 (10MarcoAurelio) With Wikimania 2017 ending soon, is it worth the effo... [19:11:18] "this patch chain goes to 11." "why not amend 10?" "this patch chain goes to 11." [19:13:50] turn up the patch chain to 11 [19:16:08] it's obviously one better [19:17:07] 10Operations, 10Wikimedia-General-or-Unknown, 10Wikimedia-Site-requests, 10Mobile: Wikimania 2017 site does not automatically redirect to mobile site, when opening from a mobile device - https://phabricator.wikimedia.org/T120943#1865328 (10Krenair) WFM [19:18:34] 10Operations, 10Wikimedia-General-or-Unknown, 10Mobile: Wikimania 2017 site does not automatically redirect to mobile site, when opening from a mobile device - https://phabricator.wikimedia.org/T120943#3522852 (10Krenair) 05Open>03Resolved a:03Dzahn I think it was fixed by https://gerrit.wikimedia.org/... [19:21:19] 10Operations, 10Wiki-Loves-Monuments (2017): Import Wiki Loves Monuments photos from Flickr to Commons - https://phabricator.wikimedia.org/T173056#3522858 (10fgiunchedi) >>! In T173056#3521172, @Multichill wrote: > @fgiunchedi what do you think are the risks? Number of incoming images maybe? Haven't seen any i... [19:24:10] 10Operations, 10Thumbor, 10User-fgiunchedi: Long running thumbnail requests locking up Thumbor instances - https://phabricator.wikimedia.org/T172930#3522861 (10fgiunchedi) [19:24:35] 10Operations, 10Thumbor, 10Performance-Team (Radar), 10User-fgiunchedi: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817#3522863 (10fgiunchedi) [19:25:20] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 1.2 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/370907 (https://phabricator.wikimedia.org/T161719) (owner: 10Gilles) [19:29:41] (03PS7) 10Reedy: Update mediawiki-codesniffer to 0.11.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367465 [19:31:58] (03CR) 10jerkins-bot: [V: 04-1] Update mediawiki-codesniffer to 0.11.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367465 (owner: 10Reedy) [19:35:56] !log upload python-thumbor-wikimedia 1.2 - T161719 [19:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:15] T161719: Add STL support (with 3d2png) to Thumbor - https://phabricator.wikimedia.org/T161719 [19:41:11] 10Operations, 10Thumbor, 10Performance-Team (Radar), 10User-fgiunchedi: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817#3522921 (10fgiunchedi) @Gilles I can reproduce at will the test failure above on stretch, thoughts? [19:44:26] 10Operations, 10Thumbor, 10User-fgiunchedi: Long running thumbnail requests locking up Thumbor instances - https://phabricator.wikimedia.org/T172930#3522926 (10fgiunchedi) FTR this happened again last night (UTC), I'm currently working on having thumbor run on stretch in T170817 which will also bring a newer... [20:07:15] 10Operations, 10DBA, 10Wikidata: Wikidata.org currently very slow - https://phabricator.wikimedia.org/T173269#3522664 (10Marostegui) I don't see anything wrong with db1082 (or the s5 master). db1082 had some spikes but traffic is back to normal: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc... [20:13:33] 10Operations, 10DBA, 10Wikidata: Wikidata.org currently very slow - https://phabricator.wikimedia.org/T173269#3522952 (10Marostegui) I am seeing one disk on db1063 (the master) with errors and predictive failure, but it has been on predictive failure since 15th Jul, and the raid is OPTIMAL status. As the tr... [20:19:18] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:20:49] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [20:21:18] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [20:34:18] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:36:18] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:36:58] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:52:38] 10Operations, 10media-storage, 10Patch-For-Review: swift upgrade plans: jessie and swift 2.x - https://phabricator.wikimedia.org/T117972#3522999 (10brion) [21:20:57] (03PS8) 10Reedy: Update mediawiki-codesniffer to 0.11.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367465 [21:23:12] (03CR) 10jerkins-bot: [V: 04-1] Update mediawiki-codesniffer to 0.11.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367465 (owner: 10Reedy) [22:45:41] !log restart elsaticsearch on elastic1017 after setting md2 readahead to 256 to match md2 on 1032-152 [22:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:17] (03PS3) 10Bearloga: rename "r" module to "r_lang" [puppet] - 10https://gerrit.wikimedia.org/r/371075 (owner: 10Gehel) [22:54:34] (03PS9) 10Bearloga: contint: profile, role, and packages for R language [puppet] - 10https://gerrit.wikimedia.org/r/363337 (https://phabricator.wikimedia.org/T153856) (owner: 10Hashar) [22:56:40] (03PS4) 10Bearloga: rename "r" module to "r_lang" [puppet] - 10https://gerrit.wikimedia.org/r/371075 (owner: 10Gehel) [23:29:28] PROBLEM - puppet last run on lvs4004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:30:58] PROBLEM - Webrequests Varnishkafka log producer on cp4016 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf [23:31:58] RECOVERY - Webrequests Varnishkafka log producer on cp4016 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf [23:56:59] PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 324.92 seconds [23:57:48] RECOVERY - puppet last run on lvs4004 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures