[02:23:55] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.1) (duration: 06m 49s) [02:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:11] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri May 12 02:30:11 UTC 2017 (duration 6m 16s) [02:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:40:48] PROBLEM - SSH on ms-be1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:41:38] RECOVERY - SSH on ms-be1019 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [03:12:08] PROBLEM - Check systemd state on mw1293 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:13:18] PROBLEM - puppet last run on mw1293 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:14:08] RECOVERY - puppet last run on mw1293 is OK: OK: Puppet is currently enabled, last run 17 minutes ago with 0 failures [03:14:59] RECOVERY - Check systemd state on mw1293 is OK: OK - running: The system is fully operational [03:32:58] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:33:48] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1021 is OK: OK ferm input default policy is set [04:10:28] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1705.40 Read Requests/Sec=3134.80 Write Requests/Sec=19.00 KBytes Read/Sec=30866.00 KBytes_Written/Sec=2907.20 [04:20:28] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=0.50 Read Requests/Sec=0.00 Write Requests/Sec=1.20 KBytes Read/Sec=0.00 KBytes_Written/Sec=28.40 [04:44:28] RECOVERY - MegaRAID on heze is OK: OK: optimal, 1 logical, 12 physical [05:46:58] PROBLEM - swift-container-replicator on ms-be2012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:46:58] PROBLEM - swift-account-server on ms-be2012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:46:58] PROBLEM - swift-account-replicator on ms-be2012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:46:58] PROBLEM - swift-account-auditor on ms-be2012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:47:48] RECOVERY - swift-container-replicator on ms-be2012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [05:47:48] RECOVERY - swift-account-server on ms-be2012 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [05:47:49] RECOVERY - swift-account-auditor on ms-be2012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [05:47:49] RECOVERY - swift-account-replicator on ms-be2012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [05:53:55] !log Stop MySQL dbstore2001 for testing - T165033 [05:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:05] T165033: dbstore2001 takes 3 hours to start MySQL after a crash - https://phabricator.wikimedia.org/T165033 [05:58:49] (03PS1) 10Marostegui: db-codfw.php: Repool db2064, depool db2063 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353492 (https://phabricator.wikimedia.org/T162611) [06:02:02] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2064, depool db2063 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353492 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [06:03:02] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2064, depool db2063 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353492 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [06:03:15] (03CR) 10jenkins-bot: db-codfw.php: Repool db2064, depool db2063 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353492 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [06:05:39] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2064, depool db2063 - T162611 (duration: 00m 39s) [06:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:47] T162611: Unify revision table on s2 - https://phabricator.wikimedia.org/T162611 [06:05:57] !log Deploy alter table on s2 (revision table) db2063 - https://phabricator.wikimedia.org/T162611 [06:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:45] (03PS2) 10Muehlenhoff: Drop cache/LVS NFS override [puppet] - 10https://gerrit.wikimedia.org/r/352748 (https://phabricator.wikimedia.org/T106477) [06:37:50] 06Operations, 07HHVM, 07Upstream: HHVM segfault in memory cleanup - https://phabricator.wikimedia.org/T162586#3257615 (10MoritzMuehlenhoff) Status update, this has now been narrowed down by Reedy to a single reproducer from the phpunit tests and one of the HHVM developers said he'd look into it a fix soon. [06:38:35] (03CR) 10Muehlenhoff: [C: 032] Drop cache/LVS NFS override [puppet] - 10https://gerrit.wikimedia.org/r/352748 (https://phabricator.wikimedia.org/T106477) (owner: 10Muehlenhoff) [06:40:58] PROBLEM - puppet last run on eventlog1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token] [06:53:24] So, who got security clearance and is a netop, so I this can be speedily responded to? https://phabricator.wikimedia.org/T165103#3257633 [06:54:59] 06Operations, 07HHVM, 07Upstream: HHVM segfault in memory cleanup - https://phabricator.wikimedia.org/T162586#3257659 (10hashar) @Reedy investigation at https://github.com/facebook/hhvm/issues/7779 shows an issue within XmlReader. That seems very close to T156923#2992912 which has hit us with HHVM 3.12.11.... [06:55:39] 06Operations, 10Beta-Cluster-Infrastructure: Mails through deployment-mx SPF & DKIM fails - https://phabricator.wikimedia.org/T87338#3257661 (10Nemo_bis) Is this still current? [06:57:24] 06Operations, 10Mail, 10Wikimedia-Mailing-lists, 05Security: Sender email spoofing - https://phabricator.wikimedia.org/T160529#3257674 (10Nemo_bis) Another example (this one is probably a compromised mailbox contact list, rather than archive scraping): P5430 [06:58:23] moritzm: good morning :) [06:58:39] 06Operations, 10Pybal, 10Traffic, 10netops: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3257675 (10elukey) Two snipped of nginx error log set to debug, from captured from the `close http upstream connection` event onwards: With Connection: close ``` 2017/0... [06:59:00] moritzm: looks like the MediaWiki tests exploding yesterday is very similar to an issue we had with a patch to HHVM 3.11 back in febuary [06:59:37] moritzm: https://phabricator.wikimedia.org/T156923#2992912 is a trace almost identical to reedy debug session yesterday [06:59:52] and the fix was to : backed out the bzip2-segfault-sweep.patch introduced in 3.12.11+dfsg-1 and that fixes it [07:03:07] 06Operations, 07HHVM, 07Upstream: HHVM segfault in memory cleanup - https://phabricator.wikimedia.org/T162586#3257676 (10MoritzMuehlenhoff) That might be related, but I'm not fully convinced. The patch we dropped from 3.12.11 was a backport from trunk. It might be that this feature was broken to begin with a... [07:04:21] hashar: I just followed up on the task, I'll add some additional information to the upstream task, but it's not unlikely that we're seeing the same crash behaviour due to an unrelated bug/memory corruption earlier on [07:04:46] I'll add that informationn, maybe it helps narrowing it down earlier. [07:07:28] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [07:09:25] <_joe_> uhm [07:09:59] RECOVERY - puppet last run on eventlog1001 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [07:10:18] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [07:10:33] moritzm: and the next thing is CI instances install whatever hhvm is in jessie-wikimedia/main . That is the 3.18 that segfault now [07:11:02] so I would need a way to install 3.12 maybe via apt:pining or roll back 3.18 from jessie-wikimedia/main [07:16:18] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [07:16:28] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:17:43] maybe Varnish again? [07:18:18] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:20:31] it seems that the 503s had cp3043 with 'int' [07:20:33] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=29&fullscreen&orgId=1&var-server=cp3043&var-datasource=esams%20prometheus%2Fops [07:20:43] the fetch failed matches with the spike [07:21:15] ema: --^ (keep reporting, not sure if you guys still need moar data or not for the task, let me know :) [07:22:24] (brb) [07:24:30] hashar: I'll catch up with you in about an hour ok? currently tracking down the history of the earlier bzip2 crash for the HHVM devs [07:24:52] I can add a temporary archive section with 3.12 for CI [07:25:25] but I'm very optimistic that this bug will be fixed in new 3.18 packages next week as well [07:41:55] moritzm: I am optimist as well. The trouble is CI magically upgrades on a daily basis :D [07:42:43] sure, we can add 3.12 archive section as an interim, no problem [07:44:41] (03CR) 10Giuseppe Lavagetto: [C: 031] ClusterShell: fix set of list options [software/cumin] - 10https://gerrit.wikimedia.org/r/352796 (https://phabricator.wikimedia.org/T164824) (owner: 10Volans) [07:50:43] (03CR) 10Giuseppe Lavagetto: [C: 031] PuppetDB backend: forbid resource's parameters regex [software/cumin] - 10https://gerrit.wikimedia.org/r/346302 (https://phabricator.wikimedia.org/T162151) (owner: 10Volans) [07:52:34] (03CR) 10Giuseppe Lavagetto: [C: 031] PuppetDB backend: consistently use InvalidQueryError [software/cumin] - 10https://gerrit.wikimedia.org/r/346301 (https://phabricator.wikimedia.org/T162151) (owner: 10Volans) [07:58:04] (03CR) 10Giuseppe Lavagetto: "I'd prefer to just let the user specify no modifier, but it's ok either way" [software/cumin] - 10https://gerrit.wikimedia.org/r/345402 (https://phabricator.wikimedia.org/T161730) (owner: 10Volans) [08:08:22] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Code is correct, I'd prefer it to be a little more formal." (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/352799 (https://phabricator.wikimedia.org/T164827) (owner: 10Volans) [08:14:04] elukey: thanks :) [08:15:17] (03CR) 10Giuseppe Lavagetto: [C: 031] Transports: move BaseWorker helper methods to module functions [software/cumin] - 10https://gerrit.wikimedia.org/r/352841 (https://phabricator.wikimedia.org/T164838) (owner: 10Volans) [08:22:40] 06Operations: Restructure our internal repositories further - https://phabricator.wikimedia.org/T158583#3257809 (10MoritzMuehlenhoff) Bikeshedding time! ppa is obviously a poor name and just meant as an example. Possible names I can think of: - component (fairly generic, but somewhat my favourite). Also pretty c... [08:25:08] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:27:50] (03CR) 10Alexandros Kosiaris: "Apart from a question inline about the IRC bot, rest LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/353088 (https://phabricator.wikimedia.org/T164911) (owner: 10Ayounsi) [08:28:27] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Small suggestion, but LGTM" (032 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/352842 (https://phabricator.wikimedia.org/T164838) (owner: 10Volans) [08:34:26] (03CR) 10Filippo Giunchedi: [C: 031] Move swift auth URL to ProductionServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353173 (owner: 10Aaron Schulz) [08:36:34] (03CR) 10Alexandros Kosiaris: [C: 031] "after some discussion in IRC I am +1ing this" [puppet] - 10https://gerrit.wikimedia.org/r/353088 (https://phabricator.wikimedia.org/T164911) (owner: 10Ayounsi) [08:37:12] (03CR) 10Giuseppe Lavagetto: [C: 031] Transports: use Command class for commands [software/cumin] - 10https://gerrit.wikimedia.org/r/352843 (https://phabricator.wikimedia.org/T164838) (owner: 10Volans) [08:39:25] (03PS7) 10Ayounsi: Various LibreNMS improvements [puppet] - 10https://gerrit.wikimedia.org/r/353088 (https://phabricator.wikimedia.org/T164911) [08:41:36] (03CR) 10Ayounsi: [C: 032] Various LibreNMS improvements [puppet] - 10https://gerrit.wikimedia.org/r/353088 (https://phabricator.wikimedia.org/T164911) (owner: 10Ayounsi) [08:42:15] (03CR) 10Giuseppe Lavagetto: [C: 031] Transports: allow to specify a timeout per Command [software/cumin] - 10https://gerrit.wikimedia.org/r/352844 (https://phabricator.wikimedia.org/T164838) (owner: 10Volans) [08:43:17] 06Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3257888 (10elukey) I just repooled mw2256, now pybal is sending health checks. I didn't find any trace of recurrence of the error, let's keep this task opened for a little longer. [08:45:16] (03CR) 10Giuseppe Lavagetto: [C: 031] "Great, thanks!" [software/cumin] - 10https://gerrit.wikimedia.org/r/352845 (https://phabricator.wikimedia.org/T164838) (owner: 10Volans) [08:48:25] 06Operations, 06Labs, 10Labs-Infrastructure: Ferm rules for labstore NFS hosts - https://phabricator.wikimedia.org/T165136#3257892 (10MoritzMuehlenhoff) [08:49:14] (03CR) 10Giuseppe Lavagetto: [C: 031] ClusterShell: allow to specify exit codes per Command (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/352892 (https://phabricator.wikimedia.org/T164833) (owner: 10Volans) [08:50:45] 06Operations, 06Labs, 10Labs-Infrastructure: Ferm rules for labstore NFS hosts - https://phabricator.wikimedia.org/T165136#3257909 (10MoritzMuehlenhoff) And labstore::misc will also need to configure a static rpcdmountd port (already done for labstore::secondary) [08:54:08] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [08:54:45] (03PS1) 10Muehlenhoff: Add initial class for ferm rules shared by all labstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/353508 (https://phabricator.wikimedia.org/T165136) [08:54:50] (03CR) 10Alexandros Kosiaris: [C: 032] backup: remove duplicate 'standard'-include [puppet] - 10https://gerrit.wikimedia.org/r/353362 (owner: 10Dzahn) [08:54:55] (03PS2) 10Alexandros Kosiaris: backup: remove duplicate 'standard'-include [puppet] - 10https://gerrit.wikimedia.org/r/353362 (owner: 10Dzahn) [08:55:10] (03CR) 10Alexandros Kosiaris: [C: 032] backup::offsite: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353363 (owner: 10Dzahn) [08:55:13] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] backup: remove duplicate 'standard'-include [puppet] - 10https://gerrit.wikimedia.org/r/353362 (owner: 10Dzahn) [08:55:25] (03PS2) 10Alexandros Kosiaris: backup::offsite: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353363 (owner: 10Dzahn) [08:55:28] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] backup::offsite: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353363 (owner: 10Dzahn) [08:55:46] (03CR) 10jerkins-bot: [V: 04-1] Add initial class for ferm rules shared by all labstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/353508 (https://phabricator.wikimedia.org/T165136) (owner: 10Muehlenhoff) [08:58:44] !log Rename semantic tables before dropping them on wikitech hosts (silver and labtestweb2001) - T164887 [08:58:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:53] T164887: Drop Semantic Database tables from wikitech wikis - https://phabricator.wikimedia.org/T164887 [09:01:15] (03PS1) 10Alexandros Kosiaris: varnish: Rename planet1001 director to planet [puppet] - 10https://gerrit.wikimedia.org/r/353509 [09:02:01] !log move planet2001 to ganeti nodegroup row_A [09:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:35] 06Operations, 10Pybal, 10Traffic, 10netops: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3257925 (10elukey) Opened a task to upstream to confirm: https://trac.nginx.org/nginx/ticket/1270#ticket [09:04:04] 06Operations, 10Wikimedia-Logstash, 06Discovery-Search (Current work): logstash mapping mixing up field types - https://phabricator.wikimedia.org/T165137#3257926 (10Gehel) [09:08:06] (03PS2) 10Muehlenhoff: Add initial class for ferm rules shared by all labstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/353508 (https://phabricator.wikimedia.org/T165136) [09:08:30] (03CR) 10Filippo Giunchedi: [C: 031] syslog::centralserver: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353124 (owner: 10Dzahn) [09:08:41] (03CR) 10Filippo Giunchedi: [C: 032] syslog::centralserver: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353124 (owner: 10Dzahn) [09:09:18] jouncebot: next [09:09:18] In 70 hour(s) and 50 minute(s): ores_classification clean up party (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170515T0800) [09:14:27] 06Operations, 10Wikimedia-Logstash, 06Discovery-Search (Current work): logstash mapping mixing up field types - https://phabricator.wikimedia.org/T165137#3257949 (10Gehel) [09:16:48] RECOVERY - mediawiki-installation DSH group on mw2146 is OK: OK [09:18:28] RECOVERY - mediawiki-installation DSH group on mw2256 is OK: OK [09:21:06] 06Operations, 10ops-codfw: mw2098 failed to come up after reboot - https://phabricator.wikimedia.org/T164959#3257956 (10MoritzMuehlenhoff) I agree with decomissioning mw2098, we have sufficient spare capacity in the codfw app server cluster to not need to worry about broken OOW hardware. [09:22:32] 06Operations, 10ops-codfw: mw2098 failed to come up after reboot - https://phabricator.wikimedia.org/T164959#3252410 (10elukey) I agree too, let's decom it! [09:27:00] (03PS1) 10Alexandros Kosiaris: Renumber planet2001.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/353511 [09:29:20] (03PS1) 10Ayounsi: Add logstash-syslog-tcp LVS service [puppet] - 10https://gerrit.wikimedia.org/r/353513 (https://phabricator.wikimedia.org/T151971) [09:33:45] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, just a minor OCD-nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/353513 (https://phabricator.wikimedia.org/T151971) (owner: 10Ayounsi) [09:34:19] going to deploy a hotfix for MobileFrontend due to a change in mediawiki core HTML output [09:34:31] that causes collapsible sections to be broken on the mobile web [09:35:09] and we need to purge the HTML cache :/ [09:36:07] (03PS2) 10Ayounsi: Add logstash-syslog-tcp LVS service [puppet] - 10https://gerrit.wikimedia.org/r/353513 (https://phabricator.wikimedia.org/T151971) [09:36:20] (03CR) 10Ayounsi: Add logstash-syslog-tcp LVS service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/353513 (https://phabricator.wikimedia.org/T151971) (owner: 10Ayounsi) [09:37:53] (03CR) 10Ayounsi: [C: 032] Add logstash-syslog-tcp LVS service [puppet] - 10https://gerrit.wikimedia.org/r/353513 (https://phabricator.wikimedia.org/T151971) (owner: 10Ayounsi) [09:39:58] PROBLEM - Host planet2001 is DOWN: PING CRITICAL - Packet loss = 100% [09:42:20] ^ expected [09:48:01] (03CR) 10Alexandros Kosiaris: [C: 032] Renumber planet2001.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/353511 (owner: 10Alexandros Kosiaris) [09:59:05] I am syncing Mobilefrontend hotfix [09:59:23] !log hashar@tin Synchronized php-1.30.0-wmf.1/extensions/MobileFrontend: Correctly handle the mw-parser-output wrapper - T164733 (duration: 00m 43s) [09:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:32] T164733: Section collapsing is broken in MobileFrontend - https://phabricator.wikimedia.org/T164733 [09:59:37] jdlrobson: deployed! [10:11:58] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [10:13:53] RECOVERY - Host planet2001 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [10:28:17] (03PS1) 10Hashar: contint: move hhvm-dev to a different class [puppet] - 10https://gerrit.wikimedia.org/r/353519 [10:38:58] <_joe_> !log moved hpssacli.tar.gz to /root on puppetmaster1001 [10:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:26] 06Operations, 10netops, 13Patch-For-Review: analytics hosts frequently tripping 'port utilization threshold' librenms alerts - https://phabricator.wikimedia.org/T133852#3258144 (10fgiunchedi) @ayounsi I'm guessing the cause was a network-intensive hadoop job, so yeah that might not be a one-time reoccurence... [10:43:49] (03PS1) 10Hashar: contint: experimental component for nodepool instances [puppet] - 10https://gerrit.wikimedia.org/r/353520 [10:59:48] 06Operations: Restructure our internal repositories further - https://phabricator.wikimedia.org/T158583#3258161 (10fgiunchedi) I thought about this for a little while but don't have a good suggestion besides `component` [11:00:26] (03CR) 10Ema: [C: 031] varnish: Rename planet1001 director to planet [puppet] - 10https://gerrit.wikimedia.org/r/353509 (owner: 10Alexandros Kosiaris) [11:03:46] 06Operations, 10ops-eqiad, 15User-fgiunchedi: HP RAID icinga alert on ms-be1021 - https://phabricator.wikimedia.org/T163777#3258166 (10fgiunchedi) Ditto on ms-be1019 now ``` Cache Status: Permanently Disabled Cache Status Details: Cable Error Cache Ratio: 10% Read / 90% Write Drive Write Cache:... [11:07:10] 06Operations, 10ops-eqiad, 15User-fgiunchedi: HP RAID icinga alert on ms-be1021 - https://phabricator.wikimedia.org/T163777#3258170 (10fgiunchedi) a:03Cmjohnson @Cmjohnson have you seen this error before? namely: ``` root@ms-be1021:~# hpssacli controller slot=3 show status Smart Array P840 in Slot 3 C... [11:09:03] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [11:11:58] (03PS1) 10Alexandros Kosiaris: Change the default LVS BGP behavior per service [debs/pybal] - 10https://gerrit.wikimedia.org/r/353525 [11:14:27] (03CR) 10Alexandros Kosiaris: [C: 032] varnish: Rename planet1001 director to planet [puppet] - 10https://gerrit.wikimedia.org/r/353509 (owner: 10Alexandros Kosiaris) [11:14:32] (03PS2) 10Alexandros Kosiaris: varnish: Rename planet1001 director to planet [puppet] - 10https://gerrit.wikimedia.org/r/353509 [11:14:38] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] varnish: Rename planet1001 director to planet [puppet] - 10https://gerrit.wikimedia.org/r/353509 (owner: 10Alexandros Kosiaris) [11:15:52] (03CR) 10Giuseppe Lavagetto: [C: 031] Change the default LVS BGP behavior per service [debs/pybal] - 10https://gerrit.wikimedia.org/r/353525 (owner: 10Alexandros Kosiaris) [11:25:52] (03CR) 10Mark Bergsma: [C: 031] Change the default LVS BGP behavior per service [debs/pybal] - 10https://gerrit.wikimedia.org/r/353525 (owner: 10Alexandros Kosiaris) [11:27:24] (03CR) 10Ema: "Should this type of changes be merged into master or into 1.13? I don't have a clear understanding of that yet, but last time we discussed" [debs/pybal] - 10https://gerrit.wikimedia.org/r/353525 (owner: 10Alexandros Kosiaris) [11:35:36] !log cleaning old elasticsearch and logstash logs on logstash cluster [11:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:54] 06Operations, 10Phabricator: Intermittent DB connectivity problem on phabricator, needs investigation - https://phabricator.wikimedia.org/T163507#3258288 (10Aklapper) (Has this happened lately? I'm not aware, so maybe this is lower priority now for us?) [11:40:55] 06Operations, 10Phabricator: Intermittent DB connectivity problem on phabricator, needs investigation - https://phabricator.wikimedia.org/T163507#3258293 (10Marostegui) From a DB server point of view we suffered a small issue with the slave (db1048) BBU again a few days ago (T160731#3246659), but that shouldn'... [11:46:52] 06Operations, 10netops: High latency for reaching Wikipedia from Jio - https://phabricator.wikimedia.org/T165103#3258312 (10ayounsi) [11:47:45] 06Operations, 10netops: High latency for reaching Wikipedia from Jio - https://phabricator.wikimedia.org/T165103#3258314 (10faidon) Thanks @Josve05a for relaying the report, that's very useful. The reverse path was going via PCCW to Singapore(!) and then to Jio in India: ``` Host... [11:57:56] 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3258338 (10elukey) After deploying https://gerrit.wikimedia.org/r/353247 to labs I hav... [12:01:41] 06Operations, 10DBA, 10Traffic: dbtree: make wasat a working backend and become active-active - https://phabricator.wikimedia.org/T163141#3258345 (10MoritzMuehlenhoff) p:05Triage>03Normal [12:03:11] 06Operations: archiva artifact links point to 127.0.0.1 - https://phabricator.wikimedia.org/T164993#3258348 (10MoritzMuehlenhoff) p:05Triage>03Normal [12:03:43] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /extract = : /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [12:04:22] 06Operations, 10netops: High latency for reaching Wikipedia from Jio - https://phabricator.wikimedia.org/T165103#3258352 (10Reedy) >>! In T165103#3258314, @faidon wrote: > I wonder how many of those cases we have and how to find them in our very big haystack. Any ideas? Presumably, that's what things like the... [12:28:37] (03PS2) 10Muehlenhoff: contint: experimental component for nodepool instances [puppet] - 10https://gerrit.wikimedia.org/r/353520 (owner: 10Hashar) [12:32:20] 06Operations, 10netops: High latency for reaching Wikipedia from Jio - https://phabricator.wikimedia.org/T165103#3258433 (10Josve05a) >>! In T165103#3258314, @faidon wrote: > @Josve05a, will you convey the fix back to them or should we? If you can, please do, otherwise, I'll just quote all comments made to t... [12:35:15] (03CR) 10Muehlenhoff: [C: 032] contint: experimental component for nodepool instances [puppet] - 10https://gerrit.wikimedia.org/r/353520 (owner: 10Hashar) [12:36:07] akosiaris: can I puppet-merge your change along? [12:37:06] (varnish: Rename planet1001 to planet) [12:38:14] doing that now [12:40:43] I, the author of this ticket, no longer have access to it. https://phabricator.wikimedia.org/T165103 --- [12:41:20] Reedy: ^ [12:41:59] so, if someone can respond to he OTRS ticket, that would be swell [12:42:08] I'm locked out of OTRS [12:42:35] (or give me a blurb to forward*) [12:42:49] Josve05a: phabricator shouldnt of restricted you from it as your the author (i believe) so thats weird [12:43:00] Just added you as a CC [12:43:42] tytill locked out [12:43:44] still* [12:44:23] Josve05a: Try again [12:44:34] Looks like the custom policy was just WMF-NDA [12:44:39] Josve05a: you may have to shift+f5 [12:44:40] I re-added a rule to allow subscribers to see it [12:44:49] (03PS2) 10Muehlenhoff: contint: move hhvm-dev to a different class [puppet] - 10https://gerrit.wikimedia.org/r/353519 (owner: 10Hashar) [12:45:00] except, that didnt' save [12:45:00] ffs [12:45:19] Now it did [12:45:39] ty [12:45:54] Josve05a: You really want to copy a few of the replies to them [12:46:05] It's been suggested they peer directly with us in AMS-IX too etc [12:46:54] (03CR) 10Muehlenhoff: [C: 032] contint: move hhvm-dev to a different class [puppet] - 10https://gerrit.wikimedia.org/r/353519 (owner: 10Hashar) [12:47:33] (03PS4) 10Giuseppe Lavagetto: role::deployment_server: generate dsh lists for zotero [puppet] - 10https://gerrit.wikimedia.org/r/353291 [12:48:46] Reedy: I've got no technical knowledge what so ever (ironically)...I asked them if they want help to join phab and see the ticket directly, or if they wanted me to relay all comments via email, and they said email was fine...so... [12:49:43] Josve05a: just copy back both replies by faidon and ayounsi to them [12:49:48] will do [12:49:55] well no, they're different [12:50:06] we should just prepare one that incorporates both of our responses [12:50:26] as I did something that fixes this, but also XioNoX made some good points [12:50:34] including asking to peer with them :) [12:52:48] Hello, I've attached two comments/responses (as PDFs) made on the ticket, which should help you guys. I'll forward your response to the team when you reply. [12:53:13] PROBLEM - Apache HTTP on mw1263 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time [12:53:15] That's simple...merging the comments to a message...that is hard [12:53:33] PROBLEM - Nginx local proxy to apache on mw1263 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.155 second response time [12:53:46] !log downgrading mw1161 (job runner) to HHVM 3.12, some known instabilities and fix for one HHVM 3.18 will likely be available next week, so going the conversative way over the weekend [12:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:13] RECOVERY - Apache HTTP on mw1263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.104 second response time [12:54:33] RECOVERY - Nginx local proxy to apache on mw1263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.179 second response time [12:55:43] Josve05a: IF you ask paravoid nicely, he'll make you a combined response to send them [12:56:13] (03PS1) 10Hashar: contint: pin HHVM packages to use experimental component [puppet] - 10https://gerrit.wikimedia.org/r/353533 (https://phabricator.wikimedia.org/T165074) [12:57:06] paravoid *puppy eyes* Would you, perhaps, think that you could, maybe, hep me write that response? :) [12:57:17] jouncebot: refresh [12:57:20] I refreshed my knowledge about deployments. [12:57:21] jouncebot: next [12:57:21] In 67 hour(s) and 2 minute(s): ores_classification clean up party (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170515T0800) [12:57:33] (03CR) 10jerkins-bot: [V: 04-1] contint: pin HHVM packages to use experimental component [puppet] - 10https://gerrit.wikimedia.org/r/353533 (https://phabricator.wikimedia.org/T165074) (owner: 10Hashar) [12:58:52] (03PS2) 10Hashar: contint: pin HHVM packages to use experimental component [puppet] - 10https://gerrit.wikimedia.org/r/353533 (https://phabricator.wikimedia.org/T165074) [12:59:27] (03CR) 10Giuseppe Lavagetto: [C: 031] "https://puppet-compiler.wmflabs.org/6403/tin.eqiad.wmnet/ DTRT" [puppet] - 10https://gerrit.wikimedia.org/r/353291 (owner: 10Giuseppe Lavagetto) [12:59:40] (03PS5) 10Giuseppe Lavagetto: role::deployment_server: generate dsh lists for zotero [puppet] - 10https://gerrit.wikimedia.org/r/353291 [13:06:19] !log repooled mw2098 (was down with hardware error) [13:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:20] 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3258483 (10Joe) >>! In T125735#3258338, @elukey wrote: > After deploying https://gerri... [13:09:18] (03CR) 10Muehlenhoff: [C: 032] contint: pin HHVM packages to use experimental component [puppet] - 10https://gerrit.wikimedia.org/r/353533 (https://phabricator.wikimedia.org/T165074) (owner: 10Hashar) [13:10:38] (03PS6) 10Giuseppe Lavagetto: role::deployment_server: generate dsh lists for zotero [puppet] - 10https://gerrit.wikimedia.org/r/353291 [13:12:28] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] role::deployment_server: generate dsh lists for zotero [puppet] - 10https://gerrit.wikimedia.org/r/353291 (owner: 10Giuseppe Lavagetto) [13:12:33] PROBLEM - swift-container-server on ms-be2012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:12:34] PROBLEM - dhclient process on ms-be2012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:12:34] PROBLEM - salt-minion processes on ms-be2012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:12:43] PROBLEM - swift-container-updater on ms-be2012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:12:53] PROBLEM - swift-container-replicator on ms-be2012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:13:03] PROBLEM - swift-account-server on ms-be2012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:13:03] PROBLEM - swift-account-auditor on ms-be2012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:13:03] PROBLEM - swift-account-replicator on ms-be2012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:13:23] PROBLEM - swift-account-reaper on ms-be2012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:13:23] PROBLEM - swift-object-replicator on ms-be2012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:13:23] PROBLEM - swift-object-updater on ms-be2012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:13:23] PROBLEM - swift-object-server on ms-be2012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:13:33] PROBLEM - swift-object-auditor on ms-be2012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:13:34] PROBLEM - swift-container-auditor on ms-be2012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:13:34] RECOVERY - salt-minion processes on ms-be2012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:13:34] RECOVERY - dhclient process on ms-be2012 is OK: PROCS OK: 0 processes with command name dhclient [13:13:34] RECOVERY - swift-container-updater on ms-be2012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [13:13:43] RECOVERY - swift-container-replicator on ms-be2012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [13:13:53] RECOVERY - swift-account-server on ms-be2012 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [13:13:53] RECOVERY - swift-account-replicator on ms-be2012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [13:13:53] RECOVERY - swift-account-auditor on ms-be2012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [13:14:13] RECOVERY - swift-object-replicator on ms-be2012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [13:14:13] RECOVERY - swift-account-reaper on ms-be2012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [13:14:13] RECOVERY - swift-object-updater on ms-be2012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [13:14:13] RECOVERY - swift-object-server on ms-be2012 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [13:14:23] RECOVERY - swift-object-auditor on ms-be2012 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [13:14:23] RECOVERY - swift-container-auditor on ms-be2012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [13:14:23] RECOVERY - swift-container-server on ms-be2012 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [13:14:42] <_joe_> wow [13:16:49] I guess something was fixed.... [13:18:13] RECOVERY - mediawiki-installation DSH group on mw2098 is OK: OK [13:20:20] 06Operations, 07HHVM: HHVM 3.18 crash on job runner / luasandbox - https://phabricator.wikimedia.org/T165043#3258535 (10MoritzMuehlenhoff) Since we're close to the weekend, I've downgraded the canary job runner (where this happened) to 3.12. There's another HHVM bug (T162586) which has now been narrowed down t... [13:28:40] 06Operations, 10Wikimedia-Logstash, 13Patch-For-Review: Move logstash ingestion behind LVS - https://phabricator.wikimedia.org/T151971#3258546 (10ayounsi) a:03ayounsi I think we're done here. Thanks for the help! [13:28:57] 06Operations, 10Wikimedia-Logstash, 13Patch-For-Review: Move logstash ingestion behind LVS - https://phabricator.wikimedia.org/T151971#3258548 (10ayounsi) 05Open>03Resolved [13:32:50] 06Operations, 07HHVM, 13Patch-For-Review, 07Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3258552 (10MoritzMuehlenhoff) Status update: The app servers mw1170-mw1184, mw1261-mw1265 are now running HHVM 3.18. It's working mostly fine there, we're seeing them crash every 2... [13:37:26] (03PS1) 10Hashar: apt:pin pref file must not have space [puppet] - 10https://gerrit.wikimedia.org/r/353540 [13:38:17] (03PS2) 10Giuseppe Lavagetto: docker::baseimages: separate build script for alpine linux [puppet] - 10https://gerrit.wikimedia.org/r/353275 (https://phabricator.wikimedia.org/T165024) [13:39:35] (03PS1) 10Hashar: contint: fix apt::pin resource name [puppet] - 10https://gerrit.wikimedia.org/r/353542 [13:42:05] (03CR) 10Muehlenhoff: [C: 032] contint: fix apt::pin resource name [puppet] - 10https://gerrit.wikimedia.org/r/353542 (owner: 10Hashar) [13:43:25] (03CR) 10Hashar: "I came accross that gem when doing:" [puppet] - 10https://gerrit.wikimedia.org/r/353540 (owner: 10Hashar) [13:47:06] (03CR) 10Giuseppe Lavagetto: [C: 032] docker::baseimages: separate build script for alpine linux [puppet] - 10https://gerrit.wikimedia.org/r/353275 (https://phabricator.wikimedia.org/T165024) (owner: 10Giuseppe Lavagetto) [13:47:16] (03PS3) 10Giuseppe Lavagetto: docker::baseimages: separate build script for alpine linux [puppet] - 10https://gerrit.wikimedia.org/r/353275 (https://phabricator.wikimedia.org/T165024) [13:47:22] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] docker::baseimages: separate build script for alpine linux [puppet] - 10https://gerrit.wikimedia.org/r/353275 (https://phabricator.wikimedia.org/T165024) (owner: 10Giuseppe Lavagetto) [13:47:41] !log rebooting mw2110-mw2117 for update to Linux 4.9 [13:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:00] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] [14:04:13] 06Operations, 07HHVM, 13Patch-For-Review, 07Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3258620 (10hashar) [14:04:20] 06Operations, 10Continuous-Integration-Infrastructure, 10Wikidata, 07HHVM, and 2 others: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3258616 (10hashar) 05Open>03Resolved @MoritzMuehlenhoff has pushed HHVM 3.12 to 'experimental' and I wrote a few rules to pin that version.... [14:06:00] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] [14:10:08] !log rebooting mw2163-mw2179 for update to Linux 4.9 [14:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:30] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 503 (expecting: 303): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /extract = : /en.wikipedia.org/v1/page/title/{title} (Get rev by title [14:26:30] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 503 (expecting: 303): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /extract = : /en.wikipedia.org/v1/page/title/{title} (Get rev by title [14:26:30] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 503 (expecting: 303): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /extract = : /en.wikipedia.org/v1/page/title/{title} (Get rev by title [14:26:30] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 503 (expecting: 303): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /extract = : /en.wikipedia.org/v1/page/title/{title} (Get rev by title [14:26:30] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 503 (expecting: 303): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /extract = : /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retr [14:26:30] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 503 (expecting: 303): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /extract = : /en.wikipedia.org/v1/page/title/{title} (Get rev by title [14:26:31] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 503 (expecting: 303): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /extract = : /en.wikipedia.org/v1/page/title/{title} (Get rev by title [14:26:31] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 503 (expecting: 303): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /extract = : /en.wikipedia.org/v1/page/title/{title} (Get rev by title [14:26:32] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 503 (expecting: 303): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /extract = : /en.wikipedia.org/v1/page/title/{title} (Get rev by title [14:26:32] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 503 (expecting: 303): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /extract = : /en.wikipedia.org/v1/page/title/{title} (Get rev by title [14:26:33] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /extract = : /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{ [14:26:40] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 503 (expecting: 303): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /extract = : /en.wikipedia.org/v1/page/title/{title} (Get rev by title [14:26:40] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 503 (expecting: 303): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /extract = : /en.wikipedia.org/v1/page/revision/{revision} (Get rev by [14:26:40] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /extract = : /en.wikipedia.org/v1/page/revision/{revision} (Get rev by ID) is CRITICAL: Test Get rev by ID returned the unexpected status 503 (expecting: 200) [14:26:40] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /extract = : /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/revision/{revisio [14:26:40] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /extract = : /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 503 (expe [14:26:40] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 503 (expecting: 303): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /extract = : /en.wikipedia.org/v1/page/title/{title} (Get rev by title [14:26:41] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /extract = : /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) [14:26:41] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 503 (expecting: 303): /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /extract = : /en.wikipedia.org/v1/page/title/{title} (Get rev by [14:26:42] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /extract = : /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/graph/png/{title} [14:26:42] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /extract = : /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{ [14:26:42] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /extract = : /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 503 (expe [14:27:00] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /extract = : /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{ [14:27:11] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /extract = : /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title f [14:27:56] ??? [14:28:16] mhh api is in trouble I think [14:30:20] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [14:30:29] (03PS1) 10Jgreen: switch indium to frlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/353551 [14:31:30] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:31:30] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [14:31:46] godog: I can see some spikes in fetch failed for say cp1055 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=29&fullscreen&orgId=1&var-server=cp1055&var-datasource=eqiad%20prometheus%2Fops&from=now-3h&to=now [14:33:20] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [14:34:30] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [14:35:20] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [14:35:48] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/348930 (https://phabricator.wikimedia.org/T148955) (owner: 10Marostegui) [14:36:43] whoa [14:36:51] godog: mediawiki api? [14:37:51] urandom: yeah, looks like it has recovered, I'm still looking on what happened [14:38:22] (03CR) 10Jgreen: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/353551 (owner: 10Jgreen) [14:38:24] elukey: could be that too, not sure yet [14:38:42] (03CR) 10Jgreen: [C: 032] switch indium to frlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/353551 (owner: 10Jgreen) [14:39:13] urandom: o/ - those alarms were in WARNING since yesterday, I wanted to ask you guys why [14:39:47] godog: from the fatal monitor I don't see spikes in errors.. [14:40:20] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:40:26] same thing from logstash hhvm [14:40:31] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:41:20] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:41:30] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:41:51] urandom: [14:41:53] elukey@restbase2002:~$ check-restbase [14:41:53] /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /extract => [14:42:00] this started yesterday [14:43:00] elukey: yeah [14:43:13] elukey: i have no answer there [14:43:26] elukey: fwiw, it seems to be working [14:43:46] i'm hoping mobrovac knows something [14:44:23] elukey: ftr, i mean when i test it seems to be working as expected, obviously the check is not working [14:44:24] 06Operations, 07HHVM: Nutcracker doesn't start at boot - https://phabricator.wikimedia.org/T163795#3258685 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [14:47:59] (03PS1) 10Giuseppe Lavagetto: docker::baseimages: fixes to the alpine build script [puppet] - 10https://gerrit.wikimedia.org/r/353555 [14:50:17] urandom: i'm here, looking into it [14:50:45] mobrovac: gm [14:51:00] urandom: hehe, that too [14:51:10] 06Operations, 10Ops-Access-Requests: add Arzhel Younsi to datacenter access lists - https://phabricator.wikimedia.org/T165054#3258690 (10RobH) [14:51:36] (03PS1) 10Muehlenhoff: Make HHVM depend on nutcracker service [puppet] - 10https://gerrit.wikimedia.org/r/353556 (https://phabricator.wikimedia.org/T163795) [14:51:46] 06Operations, 10Ops-Access-Requests: add Arzhel Younsi to datacenter access lists - https://phabricator.wikimedia.org/T165054#3255501 (10RobH) Vancis confrmed via email addition to list, EvoSwitch confirmed receipt of email but passed it to their account team. Equinix's email reset for the portal also hit his... [14:52:34] 06Operations, 10Wikimedia-Logstash, 06Discovery-Search (Current work): logstash mapping mixing up field types - https://phabricator.wikimedia.org/T165137#3258693 (10EBernhardson) Is this still ocurring? I noticed it yesterday and figured out a reproduction in beta cluster, then applied a template update whic... [14:57:23] (03CR) 10Giuseppe Lavagetto: [C: 031] Make HHVM depend on nutcracker service [puppet] - 10https://gerrit.wikimedia.org/r/353556 (https://phabricator.wikimedia.org/T163795) (owner: 10Muehlenhoff) [14:57:46] (03PS2) 10Giuseppe Lavagetto: docker::baseimages: fixes to the alpine build script [puppet] - 10https://gerrit.wikimedia.org/r/353555 [14:59:33] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [15:04:31] (03CR) 10Giuseppe Lavagetto: [C: 032] docker::baseimages: fixes to the alpine build script [puppet] - 10https://gerrit.wikimedia.org/r/353555 (owner: 10Giuseppe Lavagetto) [15:15:43] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1020 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [15:16:16] 06Operations, 10ops-eqiad: mw1172 failed to reboot - https://phabricator.wikimedia.org/T165023#3258788 (10Cmjohnson) 05Open>03Resolved Reset the hardware and server booted [15:16:43] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1020 is OK: OK ferm input default policy is set [15:20:45] (03CR) 10Anomie: [C: 031] "The idea is sound, and this seems like it would do it. I have no idea how to test it though." [puppet] - 10https://gerrit.wikimedia.org/r/353228 (https://phabricator.wikimedia.org/T107128) (owner: 10Tim Starling) [15:25:59] 06Operations, 10ops-eqiad, 10hardware-requests, 13Patch-For-Review: decom fluorine - https://phabricator.wikimedia.org/T159996#3258797 (10Cmjohnson) [15:26:31] 06Operations, 10ops-eqiad, 10hardware-requests, 13Patch-For-Review, 15User-fgiunchedi: Decommission ms-fe100[1-4] - https://phabricator.wikimedia.org/T160986#3258798 (10Cmjohnson) [15:26:58] 06Operations, 10ops-eqiad: decommission beryllium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T147934#3258805 (10Cmjohnson) [15:27:39] 06Operations, 10ops-eqiad, 10Analytics: SATA errors for stat1004 in the dmesg - https://phabricator.wikimedia.org/T162770#3258807 (10Cmjohnson) [15:30:26] (03PS1) 10Gehel: logstash - cleanup of indices is done from multiple nodes for redundancy [puppet] - 10https://gerrit.wikimedia.org/r/353559 [15:35:35] (03CR) 10DCausse: [C: 031] logstash - cleanup of indices is done from multiple nodes for redundancy [puppet] - 10https://gerrit.wikimedia.org/r/353559 (owner: 10Gehel) [15:36:51] (03CR) 10Gehel: [C: 032] logstash - cleanup of indices is done from multiple nodes for redundancy [puppet] - 10https://gerrit.wikimedia.org/r/353559 (owner: 10Gehel) [15:37:54] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] [15:39:53] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] [15:43:12] (03PS1) 10Gehel: logstash - apifeature indices need to be cleaned up [puppet] - 10https://gerrit.wikimedia.org/r/353560 [15:45:36] 06Operations, 06Discovery, 06Services (watching), 15User-mobrovac: Set up Logstash behind LVS - https://phabricator.wikimedia.org/T159004#3258840 (10Gehel) 05Open>03Resolved a:03Gehel This has been fixed by T151971. [15:46:49] gehel XioNoX thnx for ^ \o/ [15:47:25] mobrovac: now time for some testing and reconfiguring log producers... [15:48:44] 06Operations, 05Goal, 07kubernetes: Design and implement a Kubernetes-based staging environment. (stretch) - https://phabricator.wikimedia.org/T162045#3258851 (10RobH) [15:48:45] 06Operations, 10hardware-requests: EQIAD: 2 hardware access request for kubernetes-staging - https://phabricator.wikimedia.org/T162257#3258849 (10RobH) 05stalled>03Resolved systems on this have been ordered, so this request is resolved (installation is handled via sub-tasks) [15:50:10] gehel: yup, once you are happy with it, let me know so I can switch services to use it :P [15:51:12] mobrovac: it is our offsite next week, so don't expect a ping from me before May 22... but I'll be back! [15:51:43] gehel: no worries gehel, i'm out half of next week and the week after that, so not in a hurry :) [15:58:57] 06Operations, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Upgrade mysqld_exporter to 0.10.0 - https://phabricator.wikimedia.org/T161296#3258872 (10fgiunchedi) @jcrespo I have a package of mysqld-exporter 0.10.0 built on copper, if you'd like to give it a try [16:01:35] 06Operations, 10MediaWiki-ResourceLoader, 10MediaWiki-extensions-CentralNotice, 06Performance-Team, and 2 others: Provide location, logged-in status and device information in ResourceLoaderContext - https://phabricator.wikimedia.org/T103695#3258878 (10AndyRussG) Hi! Thanks much @Krinkle for explaining this... [16:07:33] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2058677 [16:08:00] 06Operations, 10MediaWiki-ResourceLoader, 10MediaWiki-extensions-CentralNotice, 06Performance-Team, and 2 others: Provide location, logged-in status and device information in ResourceLoaderContext - https://phabricator.wikimedia.org/T103695#3258920 (10AndyRussG) P.S. Since I'm pretty ignorant of service wo... [16:13:54] 06Operations, 10ops-codfw: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3258942 (10RobH) [16:14:00] 06Operations, 10ops-eqiad: rack/setup/install ores1001-1009 - https://phabricator.wikimedia.org/T165171#3258960 (10RobH) [16:17:33] PROBLEM - mediawiki-installation DSH group on mw1172 is CRITICAL: Host mw1172 is not in mediawiki-installation dsh group [16:21:13] 06Operations, 10MediaWiki-ResourceLoader, 10MediaWiki-extensions-CentralNotice, 06Performance-Team, and 2 others: Provide location, logged-in status and device information in ResourceLoaderContext - https://phabricator.wikimedia.org/T103695#3258983 (10Krinkle) @AndyRussG Service workers can help us in two... [16:23:34] !log repooled mw1172 after scap pull (was down with hardware error) [16:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:51] "MFCustomLogos config option is deprecated. Please use MinervaCustomLogos instead." [16:26:43] 06Operations, 10ops-eqiad: ripe-atlas-eqiad is down - https://phabricator.wikimedia.org/T163243#3258986 (10RobH) I've just emailed the person I ordered and followed up on shipments with for these devices, and CCed @Cmjohnson on the thread (since he will be handling the actual swap.) [16:26:44] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:32:13] (03PS1) 10BBlack: VCL: be careful about grace/keep on 0-TTL objects... [puppet] - 10https://gerrit.wikimedia.org/r/353567 (https://phabricator.wikimedia.org/T164768) [16:37:13] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /extract = : /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated [16:37:33] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 69337 [16:38:50] (03CR) 10Ema: [C: 031] VCL: be careful about grace/keep on 0-TTL objects... [puppet] - 10https://gerrit.wikimedia.org/r/353567 (https://phabricator.wikimedia.org/T164768) (owner: 10BBlack) [16:42:04] (03PS2) 10BBlack: VCL: be careful about grace/keep on 0-TTL objects... [puppet] - 10https://gerrit.wikimedia.org/r/353567 (https://phabricator.wikimedia.org/T164768) [16:42:47] 06Operations, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Ferm rules for labstore NFS hosts - https://phabricator.wikimedia.org/T165136#3259014 (10MoritzMuehlenhoff) Here's a breakdown of the current NFS port configuration for labstore and what needs to change: rpc.mountd: dumps: 32767 labstore1003... [16:46:03] PROBLEM - Host mw1294 is DOWN: PING CRITICAL - Packet loss = 100% [16:47:34] (03PS1) 10Ema: varnish: reduce keep setting on frontends [puppet] - 10https://gerrit.wikimedia.org/r/353570 (https://phabricator.wikimedia.org/T165063) [16:48:27] (03PS1) 10Bearloga: Add Shiny Server module and Discovery Dashboards role [puppet] - 10https://gerrit.wikimedia.org/r/353571 (https://phabricator.wikimedia.org/T161354) [16:50:44] (03CR) 10BBlack: [C: 031] varnish: reduce keep setting on frontends [puppet] - 10https://gerrit.wikimedia.org/r/353570 (https://phabricator.wikimedia.org/T165063) (owner: 10Ema) [16:51:25] (03PS3) 10Ema: VCL: be careful about grace/keep on 0-TTL objects... [puppet] - 10https://gerrit.wikimedia.org/r/353567 (https://phabricator.wikimedia.org/T165063) (owner: 10BBlack) [16:53:02] 06Operations, 10ops-eqiad, 10Dumps-Generation: rack/setup/install dumpsdata100[12] - https://phabricator.wikimedia.org/T165173#3259047 (10RobH) [16:53:35] !log powercycling mw1294 (machine unacessible/locked up) [16:53:40] 06Operations, 10ops-eqiad, 10Dumps-Generation: rack/setup/install dumpsdata100[12] - https://phabricator.wikimedia.org/T165173#3259047 (10RobH) @Cmjohnson: I know you have a LOT of incoming hardware right now, so once the on-site specific steps are done, you can push this to me for the puppet updates/os inst... [16:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:53] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [16:56:23] RECOVERY - Host mw1294 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [16:58:43] PROBLEM - nutcracker port on mw1294 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [16:59:43] RECOVERY - nutcracker port on mw1294 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [17:09:46] thcipriani: Getting ready to roll out https://gerrit.wikimedia.org/r/#/c/353582/1 [17:10:06] Krinkle: I saw, seems fine [17:11:31] Okay, thanks for checking. [17:15:32] (03PS1) 10BBlack: VCL: Do not assume obj.grace > grace_healthy [puppet] - 10https://gerrit.wikimedia.org/r/353585 (https://phabricator.wikimedia.org/T165063) [17:17:33] RECOVERY - mediawiki-installation DSH group on mw1172 is OK: OK [17:22:12] Krinkle: oh, I just remembered, I reverted mwdebug1002 to 1.29 to test something, that may be confusing if you use that server to check this :) [17:22:32] thcipriani: Was gonna use 1001, but thanks. [17:22:34] Doing it now. [17:23:28] okie doke, I'll unrevert 1002 shortly/when you're done [17:30:45] yeah, just taking a while to verify. [17:30:50] Not getting the result I want in logstash [17:32:02] OK. Go tit [17:32:05] got it [17:32:44] syncing now [17:33:17] !log krinkle@tin Synchronized php-1.30.0-wmf.1/includes/resourceloader/ResourceLoaderClientHtml.php: (no justification provided) (duration: 00m 40s) [17:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:55] (03CR) 10Ema: [V: 032 C: 032] varnish: reduce keep setting on frontends [puppet] - 10https://gerrit.wikimedia.org/r/353570 (https://phabricator.wikimedia.org/T165063) (owner: 10Ema) [17:36:15] (03PS4) 10BBlack: VCL: be careful about grace/keep on 0-TTL objects... [puppet] - 10https://gerrit.wikimedia.org/r/353567 (https://phabricator.wikimedia.org/T165063) [17:38:14] !log cp4010: upgrade varnish back to 4.1.6-1wm1, transient storage issues are unrelated [17:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:26] (03PS2) 10BBlack: VCL: Do not assume obj.grace > grace_healthy [puppet] - 10https://gerrit.wikimedia.org/r/353585 (https://phabricator.wikimedia.org/T165063) [17:38:36] (03CR) 10BBlack: [V: 032 C: 032] VCL: Do not assume obj.grace > grace_healthy [puppet] - 10https://gerrit.wikimedia.org/r/353585 (https://phabricator.wikimedia.org/T165063) (owner: 10BBlack) [17:38:56] (03PS5) 10BBlack: VCL: be careful about grace/keep on 0-TTL objects... [puppet] - 10https://gerrit.wikimedia.org/r/353567 (https://phabricator.wikimedia.org/T165063) [17:39:05] (03CR) 10BBlack: [V: 032 C: 032] VCL: be careful about grace/keep on 0-TTL objects... [puppet] - 10https://gerrit.wikimedia.org/r/353567 (https://phabricator.wikimedia.org/T165063) (owner: 10BBlack) [17:49:15] !log thcipriani@tin Synchronized php-1.30.0-wmf.1/includes/parser/Parser.php: [[gerrit:353584|Revert "Wrap parser output in
"]] 1/4 (duration: 00m 39s) [17:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:10] !log thcipriani@tin Synchronized php-1.30.0-wmf.1/includes/cache/MessageCache.php: [[gerrit:353584|Revert "Wrap parser output in
"]] 2/4 (duration: 00m 39s) [17:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:59] !log thcipriani@tin Synchronized php-1.30.0-wmf.1/includes/api/ApiParse.php: [[gerrit:353584|Revert "Wrap parser output in
"]] 3/4 (duration: 00m 42s) [17:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:27] !log thcipriani@tin Started scap: [[gerrit:353584|Revert "Wrap parser output in
"]] 4/4 [17:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:28] (03PS2) 10Dzahn: syslog::centralserver: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353124 [18:01:06] (03PS3) 10Dzahn: syslog::centralserver: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353124 [18:02:58] ehm. anyone reported yet that lot of the TextExtracts seem to be MIA ? [18:03:50] page previews for lots of links are failing, and if I look at the XHR, they are all missing their extract in that case [18:04:53] ah, i see it's reported already [18:05:17] thedj: yeah there's a ticket https://phabricator.wikimedia.org/T165161 [18:05:25] the above sync's are meant to address this issue [18:05:38] there's still more to do afai understand it though [18:10:40] !log thcipriani@tin Finished scap: [[gerrit:353584|Revert "Wrap parser output in
"]] 4/4 (duration: 19m 13s) [18:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:24] ^ mobrovac anomie sync is complete for T165161 issue [18:11:24] T165161: Text extracts empty for some articles - https://phabricator.wikimedia.org/T165161 [18:12:44] i was already afraid that that change would wreck more havoc than anticipated :( [18:29:36] (03CR) 10Dzahn: "no-op confirmed on wezen and lithium" [puppet] - 10https://gerrit.wikimedia.org/r/353124 (owner: 10Dzahn) [18:29:52] PROBLEM - configured eth on ms-be1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:30:42] RECOVERY - configured eth on ms-be1021 is OK: OK - interfaces up [18:31:13] (03CR) 10Dzahn: [C: 032] piwik: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353354 (owner: 10Dzahn) [18:31:19] (03PS2) 10Dzahn: piwik: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353354 [18:34:18] (03PS2) 10Bearloga: Add Shiny Server module and Discovery Dashboards role [puppet] - 10https://gerrit.wikimedia.org/r/353571 (https://phabricator.wikimedia.org/T161354) [18:35:21] (03CR) 10Dzahn: "no-op confirmed on bohrium" [puppet] - 10https://gerrit.wikimedia.org/r/353354 (owner: 10Dzahn) [18:35:39] (03CR) 10Dzahn: [C: 032] dumps::zim: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353358 (owner: 10Dzahn) [18:35:45] (03PS2) 10Dzahn: dumps::zim: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353358 [18:43:04] (03PS1) 10Legoktm: Add RejectParserCacheValue handler for mw-parser-output invalidation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353596 (https://phabricator.wikimedia.org/T165161) [18:43:17] thcipriani, MaxSem ^ [18:44:32] (03CR) 10MaxSem: "Can you add a commit referring to the bug?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353596 (https://phabricator.wikimedia.org/T165161) (owner: 10Legoktm) [18:44:46] (03CR) 10MaxSem: "s/commit/comment/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353596 (https://phabricator.wikimedia.org/T165161) (owner: 10Legoktm) [18:45:09] (03PS2) 10Legoktm: Add RejectParserCacheValue handler for mw-parser-output invalidation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353596 (https://phabricator.wikimedia.org/T165161) [18:45:16] MaxSem: done [18:45:49] (03CR) 10Dzahn: "confirmed no-op on francium" [puppet] - 10https://gerrit.wikimedia.org/r/353358 (owner: 10Dzahn) [18:46:02] (03CR) 10MaxSem: [C: 031] Add RejectParserCacheValue handler for mw-parser-output invalidation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353596 (https://phabricator.wikimedia.org/T165161) (owner: 10Legoktm) [18:46:58] (03CR) 10Dzahn: [C: 032] webperf: move 'standard' and 'base::firewall' to role [puppet] - 10https://gerrit.wikimedia.org/r/353359 (owner: 10Dzahn) [18:46:59] legoktm: ok, so this one to clear parsercache, textextracts change for extract cache, correct? [18:47:12] (03PS2) 10Dzahn: webperf: move 'standard' and 'base::firewall' to role [puppet] - 10https://gerrit.wikimedia.org/r/353359 [18:47:53] (03CR) 10Chad: Setup apache vhost on scap proxies as well (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/344221 (owner: 10Chad) [18:47:53] !log starting spaced-out ~4h run of "run-no-puppet varnish-frontend-restart" on cache_upload+cache_text to re-set transient storage levels (in screen on neodymium) [18:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:51] thcipriani: yes... [18:49:34] (03CR) 10Thcipriani: [C: 032] Add RejectParserCacheValue handler for mw-parser-output invalidation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353596 (https://phabricator.wikimedia.org/T165161) (owner: 10Legoktm) [18:49:35] parser cache one needs to go before TE [18:50:39] (03Merged) 10jenkins-bot: Add RejectParserCacheValue handler for mw-parser-output invalidation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353596 (https://phabricator.wikimedia.org/T165161) (owner: 10Legoktm) [18:51:00] (03PS3) 10Chad: DynamicSidebar: Use standard extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352979 [18:52:35] legoktm: live on mwdebug1002, check please [18:53:08] thcipriani: works! [18:54:09] legoktm: ok, going live [18:55:58] we should see a increase in "rejected" on https://grafana.wikimedia.org/dashboard/db/parser-cache?refresh=5m&orgId=1 [18:56:03] (03CR) 10Dzahn: "confirmed no-op on hafnium" [puppet] - 10https://gerrit.wikimedia.org/r/353359 (owner: 10Dzahn) [18:56:07] legoktm: well. the scap canary check works [18:56:21] legoktm: Warning: strpos() expects parameter 1 to be string, object given [18:56:35] uh [18:56:41] oh ffs [18:56:47] don't sync it yet [18:56:58] or cancel [18:57:07] yup, it's not going [18:57:56] (03PS1) 10Legoktm: Fix RejectParserCacheValue hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353597 [18:58:10] I've definitely made this same exact mistake before [18:58:16] (03CR) 10MaxSem: [C: 032] Fix RejectParserCacheValue hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353597 (owner: 10Legoktm) [18:58:42] make ParserOutput stringify to getText()? [18:59:19] (03Merged) 10jenkins-bot: Fix RejectParserCacheValue hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353597 (owner: 10Legoktm) [19:00:00] legoktm: MaxSem it's live on mwdebug1002 if you want to check [19:00:24] tested [19:00:32] ok, going live [19:00:53] (03CR) 10jenkins-bot: Add RejectParserCacheValue handler for mw-parser-output invalidation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353596 (https://phabricator.wikimedia.org/T165161) (owner: 10Legoktm) [19:00:55] (03CR) 10jenkins-bot: Fix RejectParserCacheValue hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353597 (owner: 10Legoktm) [19:01:34] hhvm.log is still being spammed with that message? [19:02:37] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: [[gerrit:353597|Add RejectParserCacheValue handler for mw-parser-output]] T165161 (duration: 00m 40s) [19:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:45] T165161: Text extracts empty for some articles - https://phabricator.wikimedia.org/T165161 [19:02:54] legoktm: yeah, it was on the canaries still, should be clear now [19:20:51] !log thcipriani@tin Synchronized php-1.30.0-wmf.1/extensions/TextExtracts/includes/ApiQueryExtracts.php: [[gerrit:353593|API: Change memcache key to clear cache]] T165161 (duration: 00m 39s) [19:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:00] T165161: Text extracts empty for some articles - https://phabricator.wikimedia.org/T165161 [19:29:54] (03PS1) 10Dzahn: bastionhost: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/353599 [19:45:51] (03PS1) 10Dzahn: phabricator: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/353600 [19:46:52] (03CR) 10jerkins-bot: [V: 04-1] phabricator: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/353600 (owner: 10Dzahn) [19:50:51] (03CR) 10Dzahn: [C: 04-1] "where does the diff come from http://puppet-compiler.wmflabs.org/6406/iron.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/353599 (owner: 10Dzahn) [19:54:16] (03PS4) 10Chad: DynamicSidebar: Use standard extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352979 [19:54:18] (03PS2) 10Dzahn: phabricator: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/353600 [19:59:41] (03PS3) 10Dzahn: phabricator: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/353600 [19:59:43] 06Operations, 10DBA, 10Traffic: dbtree: make wasat a working backend and become active-active - https://phabricator.wikimedia.org/T163141#3259609 (10Dzahn) a:03Dzahn [20:00:58] (03PS1) 10Ladsgroup: Write in term_full_entity_id in testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353601 (https://phabricator.wikimedia.org/T165197) [20:02:15] (03CR) 10Dzahn: [C: 04-1] "http://puppet-compiler.wmflabs.org/6408/iridium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/353600 (owner: 10Dzahn) [20:08:27] (03PS4) 10Dzahn: phabricator: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/353600 [20:11:54] (03CR) 10Paladox: "This will need more testing then the other changes that converted classes into profile due to phabricator being critical. I can test this " [puppet] - 10https://gerrit.wikimedia.org/r/353600 (owner: 10Dzahn) [20:32:34] (03CR) 10Legoktm: [C: 031] "It's already using extension.json in extension-list." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352979 (owner: 10Chad) [20:40:28] (03CR) 10Chad: "Yeah I already swapped the extension-list entry from the wikitech-specific one." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352979 (owner: 10Chad) [20:40:42] (03CR) 10Chad: [C: 032] DynamicSidebar: Use standard extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352979 (owner: 10Chad) [20:42:04] (03Merged) 10jenkins-bot: DynamicSidebar: Use standard extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352979 (owner: 10Chad) [20:42:16] (03CR) 10jenkins-bot: DynamicSidebar: Use standard extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352979 (owner: 10Chad) [20:49:31] !log demon@tin Synchronized wmf-config/: Swapping DynamicSidebar to normal extension registration (duration: 00m 19s) [20:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:29] legoktm: Thanks for the CR by the way [20:50:40] np [20:50:58] One less wikitech-specific thingie in config :) [20:51:16] Plus now makes it possible for other wikis to request the extension, should they want it. [20:52:38] wtf? Why is it spamming about undefined variable still? [20:54:23] If it's undefined, how is it loading & working on wikitech? [20:54:46] Hmm [20:55:40] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: Touch (duration: 00m 39s) [20:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:12] Yep, that was it [20:56:14] Cached [20:56:17] Fun times! [20:56:53] I thought scap 3 was configured to auto touch files [21:05:30] No, it doesn't work like that. [21:05:40] Also: MW doesn't use most of scap3 codebase yet, even if it did [21:15:04] (03CR) 10Krinkle: [C: 031] Move contribution tracking config to CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342857 (https://phabricator.wikimedia.org/T147479) (owner: 10Chad) [21:18:42] Krinkle: I want to merge, but nobody's had time to help me babysit it ^ :( [21:19:02] this also seems like a serious regression: https://phabricator.wikimedia.org/T165132 [21:19:28] (03CR) 10Krinkle: Jenkins: install jdk, not just jre (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/348961 (owner: 10Chad) [21:19:34] (03CR) 10Krinkle: [C: 04-1] Jenkins: install jdk, not just jre [puppet] - 10https://gerrit.wikimedia.org/r/348961 (owner: 10Chad) [21:20:00] RainbowSprinkles: Yeah, need's someone from fr to approve first [21:21:15] (03PS3) 10Chad: Jenkins: install jdk, not just jre [puppet] - 10https://gerrit.wikimedia.org/r/348961 [21:22:03] (03CR) 10Paladox: [C: 031] Jenkins: install jdk, not just jre [puppet] - 10https://gerrit.wikimedia.org/r/348961 (owner: 10Chad) [21:27:12] (03CR) 10Chad: [C: 04-1] "I don't think we need this anymore. If we do, it should be well researched with config based in reality--rather than just copied from who " [puppet] - 10https://gerrit.wikimedia.org/r/327763 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [21:27:24] (03Abandoned) 10Paladox: Gerrit: Enable g1 gc as we now use java 8 [puppet] - 10https://gerrit.wikimedia.org/r/327763 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [21:28:42] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3259797 (10demon) 05Open>03Resolved a:03demon This hasn't been a problem since I readjusted internal caches (less churn) and we lowered the heap (don't need it so high)... [21:50:13] (03Draft1) 10Paladox: Gerrit: Remove velocity templates but keep the ones for its-base [puppet] - 10https://gerrit.wikimedia.org/r/353693 (https://phabricator.wikimedia.org/T158008) [21:50:17] (03PS2) 10Paladox: Gerrit: Remove velocity templates but keep the ones for its-base [puppet] - 10https://gerrit.wikimedia.org/r/353693 (https://phabricator.wikimedia.org/T158008) [21:50:31] (03CR) 10Paladox: [C: 04-1] "Requires us to upgrade to gerrit 2.14 first." [puppet] - 10https://gerrit.wikimedia.org/r/353693 (https://phabricator.wikimedia.org/T158008) (owner: 10Paladox) [21:58:58] (03CR) 10Aude: [C: 04-1] "testwikidata didn't get the new column yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353601 (https://phabricator.wikimedia.org/T165197) (owner: 10Ladsgroup) [22:20:46] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected body: /tfa/extract =