[00:09:14] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10HTTPS: Wikimedia's recent upgrade to nginx v. 1.13.6 breaks older Android HTTP libraries - https://phabricator.wikimedia.org/T180269#3751867 (10Mholloway) For the record, the current stable and beta versions of the official Wikipedia Android app do... [00:48:09] !log Decommissioning Cassandra, restbase2006-b.codfw.wmnet (T179422) [00:48:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:16] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [02:25:19] (03Abandoned) 10Reedy: Disable inject recent changes on all client wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383107 (https://phabricator.wikimedia.org/T171027) (owner: 10Reedy) [02:30:48] 10Operations, 10Ops-Access-Requests, 10Discovery, 10Wikidata, and 3 others: allow wdqs-admins to pool / depool wdqs servers - https://phabricator.wikimedia.org/T172798#3752211 (10Smalyshev) 05Open>03Resolved `HOME=/root sudo depool` seems to be working. [02:34:58] PROBLEM - puppet last run on es2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:03:48] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Barack Obama) timed out before a response was received: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received [03:04:39] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [03:04:58] RECOVERY - puppet last run on es2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [03:29:38] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 900.50 seconds [03:53:48] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 189.73 seconds [06:11:38] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Barack Obama) timed out before a response was received: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-se [06:11:38] out before a response was received: / (spec from root) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on January 15) timed out before a response was received: /{domain}/v1/page/featured/{yyyy}/{mm}/{dd} (retrieve featured article info for unsupported site (with aggregated=true)) timed out before [06:11:38] ived: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received [06:13:38] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Barack Obama) timed out before a response was received: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received [06:14:29] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [07:03:42] (03PS3) 10Krinkle: StartProfiler.php: Add lots of documentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390548 (https://phabricator.wikimedia.org/T180183) [07:04:38] (03CR) 10jerkins-bot: [V: 04-1] StartProfiler.php: Add lots of documentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390548 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [07:05:49] (03PS4) 10Krinkle: StartProfiler.php: Add lots of documentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390548 (https://phabricator.wikimedia.org/T180183) [08:21:45] (03CR) 10Framawiki: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390188 (https://phabricator.wikimedia.org/T180046) (owner: 10TerraCodes) [08:22:54] (03CR) 10jerkins-bot: [V: 04-1] Adjust throttle.php for dewiki workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390188 (https://phabricator.wikimedia.org/T180046) (owner: 10TerraCodes) [09:14:39] (03PS3) 10Lokal Profil: Support prefixed dump types [dumps/dcat] - 10https://gerrit.wikimedia.org/r/390312 (https://phabricator.wikimedia.org/T163328) [09:17:02] (03CR) 10Lokal Profil: "There seems to be no ci (not even linting) on this repo. Not sure what the standard is normally with the ops repos but would probably not " [dumps/dcat] - 10https://gerrit.wikimedia.org/r/390312 (https://phabricator.wikimedia.org/T163328) (owner: 10Lokal Profil) [09:19:45] (03CR) 10Lokal Profil: "if structure looks good then I'll also add the truthy.nt files and add a companion patch to modules/snapshot/files/cron/dcatconfig.json in" [dumps/dcat] - 10https://gerrit.wikimedia.org/r/390312 (https://phabricator.wikimedia.org/T163328) (owner: 10Lokal Profil) [09:23:58] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:43:28] PROBLEM - WDQS SPARQL on wdqs2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.072 second response time [09:43:49] PROBLEM - WDQS HTTP on wdqs2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.072 second response time [09:44:28] PROBLEM - Check systemd state on wdqs2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:46:25] Data import in progress for wdqs2001, downtime probably expired. Cc SMalyshev [09:53:58] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:16:38] (03PS3) 10Zoranzoki21: Adjust throttle.php for dewiki workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390188 (https://phabricator.wikimedia.org/T180046) (owner: 10TerraCodes) [10:17:54] (03CR) 10Zoranzoki21: [C: 031] StartProfiler.php: Add lots of documentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390548 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [10:18:27] (03CR) 10Zoranzoki21: "OK is now. CR: +1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390188 (https://phabricator.wikimedia.org/T180046) (owner: 10TerraCodes) [10:18:36] (03CR) 10Zoranzoki21: [C: 031] Adjust throttle.php for dewiki workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390188 (https://phabricator.wikimedia.org/T180046) (owner: 10TerraCodes) [10:21:31] (03CR) 10TerraCodes: [C: 031] "It was failing the test because of spacing. wow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390188 (https://phabricator.wikimedia.org/T180046) (owner: 10TerraCodes) [10:21:56] (03PS4) 10TerraCodes: Adjust throttle.php for dewiki workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390188 (https://phabricator.wikimedia.org/T180046) [10:24:18] (03CR) 10Zoranzoki21: [C: 031] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390188 (https://phabricator.wikimedia.org/T180046) (owner: 10TerraCodes) [10:29:36] (03CR) 10Zoranzoki21: [C: 031] "> It was failing the test because of spacing. wow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390188 (https://phabricator.wikimedia.org/T180046) (owner: 10TerraCodes) [10:36:38] PROBLEM - Host ms-fe1008 is DOWN: PING CRITICAL - Packet loss = 100% [10:38:18] RECOVERY - Host ms-fe1008 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [10:47:32] 10Operations, 10Trending-Service, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban), and 4 others: Update trending-edits' node-rdkafka to v1.x - https://phabricator.wikimedia.org/T179786#3752406 (10mobrovac) Thank you, @bearND for looking into it. >>! In T179786#3752001, @bearND wrote: >... [13:52:35] !log Decommissioning Cassandra, restbase2006-c.codfw.wmnet (T179422) [13:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:43] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [13:57:08] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received: /v1/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed o [13:57:08] e was received [13:57:58] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [15:32:13] (03PS1) 10Ori.livneh: xenon: pass --mindwidth to flamegraph.pl [puppet] - 10https://gerrit.wikimedia.org/r/390645 [15:59:58] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on January 15) is CRITICAL: Test retrieve all events on January 15 returned the unexpected status 400 (expecting: 200) [16:00:49] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [16:12:49] PROBLEM - Disk space on install1002 is CRITICAL: DISK CRITICAL - free space: / 3018 MB (3% inode=98%) [16:42:09] PROBLEM - HHVM jobrunner on mw1308 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [16:43:08] RECOVERY - HHVM jobrunner on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [17:00:49] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on January 15) is CRITICAL: Test retrieve all events on January 15 returned the unexpected status 400 (expecting: 200) [17:02:49] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [17:16:38] PROBLEM - apertium apy on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:49] PROBLEM - SSH on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:19] PROBLEM - eventstreams on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:28] RECOVERY - apertium apy on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 5736 bytes in 0.002 second response time [17:17:29] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:39] RECOVERY - SSH on scb1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [17:21:58] PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received [17:22:18] RECOVERY - eventstreams on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.024 second response time [17:22:48] RECOVERY - graphoid endpoints health on scb1001 is OK: All endpoints are healthy [17:29:28] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [18:07:38] PROBLEM - Disk space on install1002 is CRITICAL: DISK CRITICAL - free space: / 3018 MB (3% inode=98%) [20:24:53] (03PS12) 10TerraCodes: Remove overlapping userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) [21:12:28] PROBLEM - HHVM rendering on mw1298 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:13:19] RECOVERY - HHVM rendering on mw1298 is OK: HTTP OK: HTTP/1.1 200 OK - 75405 bytes in 0.170 second response time [21:14:59] PROBLEM - pdfrender on scb1002 is CRITICAL: connect to address 10.64.16.21 and port 5252: Connection refused [21:17:59] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.006 second response time [22:27:29] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:27:30] PROBLEM - apertium apy on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:27:38] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: / (spec from root) timed out before a response was received: /{domain}/v1/page/featured/{yyyy}/{mm}/{dd} (retrieve featured article info for unsupported site (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received [22:28:18] PROBLEM - eventstreams on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 8092: Connection refused [22:28:28] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [22:28:28] RECOVERY - apertium apy on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 5736 bytes in 0.003 second response time [22:28:38] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [22:29:18] RECOVERY - eventstreams on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 4.587 second response time [22:29:58] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Barack Obama) timed out before a response was received: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-se [22:29:58] out before a response was received: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on January 15) timed out before a response was received: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received [22:31:09] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v1/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [22:31:49] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [22:32:08] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [22:49:02] Hello [23:07:11] (03CR) 10Zoranzoki21: [C: 031] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) (owner: 10TerraCodes) [23:31:04] (03CR) 10Zoranzoki21: [C: 031] search.wikimedia.org: Clean up result returning logic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390357 (owner: 10Chad) [23:31:39] (03CR) 10Zoranzoki21: [C: 031] WIP: search.wikimedia.org: Stop supporting non-Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390358 (owner: 10Chad) [23:32:20] 10Operations, 10Traffic, 10Patch-For-Review: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#3752773 (10Tgr) This continues to be a pain point for WP0 abuse, and probably a major accident waiting to happen in general (imagine failing to honor DMCA takedown time limits, or some kind... [23:33:00] (03CR) 10Zoranzoki21: [C: 031] Enable per-filter profiling on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390153 (https://phabricator.wikimedia.org/T179323) (owner: 10Dmaza) [23:39:08] HI! Please abandon patch https://gerrit.wikimedia.org/r/#/c/356046/ because it should be fixed on translatewiki [23:40:57] or you could constructively post that as a comment on the patchset, with further information on how to do it properly so the committer knows for in the future [23:42:48] ok [23:53:57] Zoranzoki21 Hi, en and qqq changes should be done in the repo :). Not in translatewiki. [23:54:19] Oh ok