[00:00:48] ostriches: ping, I'm done [00:01:35] Thanks [00:04:12] !log Started `mwscript updateCollation.php itwiki --previous-collation=uppercase` on Terbium (T136647) [00:04:13] T136647: Set UCA-IT as it.wiki's collation - https://phabricator.wikimedia.org/T136647 [00:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:08:34] (03CR) 10BryanDavis: [C: 031] "Looks sane. I wouldn't worry about bikeshed issues like the choice of name for the base class to interface with the job execution service." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/292028 (owner: 10Yuvipanda) [00:21:55] Danny_B: i can add you to that phabricator email. but which mail address do you want to use? and its gonna be public [00:27:42] mutante: i don't mind if it is public, but i would prefer it should not be crawlable (aka, can they be obfuscated by ie splitting them to "username" .. "@" .. "domain" (idk what is string concat operator in pp)?) [00:27:43] (03CR) 10Dereckson: "Not deployed during 2016-05-30 evening SWAT, as Eranroz wasn't there." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284087 (https://phabricator.wikimedia.org/T132972) (owner: 10Eranroz) [00:34:36] uhm yea, certainly possible but kind of a new feature request [00:36:10] current addresses are not obfuscated, but you know what, i could just add a list address and then people can subscribe and unsubscribe as they want and not worry about this part [00:42:09] mutante: \o/ i had a task prepared to submit to suggest some more flexible way than hardcoding in source. i was thinking about other types of subscription, but maillist is the simpliest since it doesn't require any inventing, just using of what we already use... want me to create task for it? [00:43:06] Danny_B: ok :) yes, please [00:43:49] mutante: assigning directly to you, ok? [00:44:53] Danny_B: subscribe but not assign would be perfect [00:45:24] wikimedia-mailing-lists tag [00:48:21] mutante: t136660 [00:48:31] stashbot, you slacker! [01:02:11] T136660 [01:02:11] T136660: Create phabricator-reports maillist - https://phabricator.wikimedia.org/T136660 [01:03:54] pywikibot-bugs Monitor updates to Bugzilla or Phabricator bug tracker [01:04:02] Pywikipedia-bugs Monitor updates to Bugzilla or Phabricator bug tracker [01:04:11] 2 mailing lists with identical description? [01:05:12] plus pywikibot, pywikibot-announce, Pywikipedia-announce, Pywikibot-commits and Pywikipedia-l [01:05:58] asks OIT to alias them all into one google account and hides, hehehee [01:06:05] j/k [01:06:32] mutante: lists were renamed s/pywikipedia/pywikibot/ but old names kept so archive links don't break? [01:08:18] legoktm: yes, you are right. Pywikipedia-bugs listinfo page is a redirect [01:08:34] so thats normal for renaming, ok yes [01:09:08] the only difference could be that its not hidden from the index of all lists [01:19:09] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1139.21 seconds [01:22:17] (03PS1) 10Dzahn: phabricator: send maintenance mail to list not individuals [puppet] - 10https://gerrit.wikimedia.org/r/292065 (https://phabricator.wikimedia.org/T136660) [01:25:44] (03CR) 10Dzahn: [C: 032] phabricator: send maintenance mail to list not individuals [puppet] - 10https://gerrit.wikimedia.org/r/292065 (https://phabricator.wikimedia.org/T136660) (owner: 10Dzahn) [01:28:25] mutante: possible to trigger the mail creation script to test the delivery? [01:40:14] Danny_B: yes, just did and it delivered [01:45:53] (03PS2) 10Dzahn: Weekly Phabricator email: List archived projects with open tasks [puppet] - 10https://gerrit.wikimedia.org/r/291781 (https://phabricator.wikimedia.org/T133649) (owner: 10Aklapper) [01:46:48] (03PS1) 10Alex Monk: Follow-up abd14947: Use HTTPS URL to citoid instead of protocol-relative [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292068 (https://phabricator.wikimedia.org/T136423) [01:47:40] mutante: ah, you probably did it before i managed to subscribe (confirm) [01:47:53] Danny_B: no, i already subscribed you before [01:48:44] aaa, it just popped in my mbox [01:49:10] i used the form "mass subscription" [01:49:18] 06Operations, 10Traffic, 07Beta-Cluster-reproducible: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#2001011 (10AlexMonk-WMF) I just got this again after making MW fatal in beta. Varnish started sending 503s instead, and varnishlog showed "... [01:49:19] ok [01:49:35] mutante: i've commented your diff in diffusion not in gerrit [01:50:34] i would like to stay with gerrit until its officially not supported [01:52:09] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.31 seconds [02:07:48] 06Operations, 10Traffic, 07Beta-Cluster-reproducible: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#2343562 (10BBlack) When you make MW fatal in beta, does hhvm emit some kind of valid error output, or is it truncating its output due to th... [02:30:54] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.3) (duration: 09m 30s) [02:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:31:06] 06Operations, 10Traffic, 07Beta-Cluster-reproducible: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#2343586 (10AlexMonk-WMF) My change seems to make MW fatal when Varnish contacts it but not when I try to curl. So it's difficult to tell. [02:33:35] (03CR) 10Dzahn: [C: 032] "i confirmed the "phstats" mysql user has the access for this, the query works and takes about 1.5 seconds" [puppet] - 10https://gerrit.wikimedia.org/r/291781 (https://phabricator.wikimedia.org/T133649) (owner: 10Aklapper) [02:40:21] 06Operations, 10Traffic, 07Beta-Cluster-reproducible: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#2343599 (10BBlack) Can you give me a way to reproduce it? [02:50:52] PROBLEM - puppet last run on elastic1045 is CRITICAL: CRITICAL: Puppet has 1 failures [03:04:32] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.4) (duration: 15m 42s) [03:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:04:55] 06Operations, 10Traffic, 07Beta-Cluster-reproducible: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#2343671 (10AlexMonk-WMF) I just started typing instructions, went to test them (in my separate chrome session - which exists for this extra... [03:11:11] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Jun 1 03:11:11 UTC 2016 (duration 6m 39s) [03:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:17:50] RECOVERY - puppet last run on elastic1045 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [03:32:59] (03PS1) 10Papaul: DNS: Add mgmt dns for the frist 24 new mw app servers mw22[1-3[0-9] Bug:T135466 [dns] - 10https://gerrit.wikimedia.org/r/292072 (https://phabricator.wikimedia.org/T135466) [03:36:48] 06Operations, 10ops-codfw: rack/setup/deploy new codfw mw app servers - https://phabricator.wikimedia.org/T135466#2343709 (10Papaul) [03:51:25] 06Operations, 10ops-codfw: rack/setup/deploy mw22[1-5][0-9] switch configuration - https://phabricator.wikimedia.org/T136670#2343721 (10Papaul) [03:52:01] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 3 failures [04:17:00] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:27:51] !log demon@tin Synchronized php-1.28.0-wmf.3/includes/: reapplied new version of I03739e94 (duration: 01m 34s) [04:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:31:03] !log demon@tin Synchronized php-1.28.0-wmf.4/includes/: reapplied new version of I03739e94 (duration: 01m 21s) [04:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:50:35] 06Operations, 06Labs, 10netops: Intermittent bandwidth issue to labs proxy (eqiad) from Comcast in Portland OR - https://phabricator.wikimedia.org/T136671#2343755 (10brion) [05:04:58] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) [05:08:47] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. [05:26:44] !log restbase deploy start of 5c99693 [05:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:41:42] !log restbase deploy end of 5c99693 [05:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:49:34] (03PS3) 10Urbanecm: Enable VE in NS_PROJECT in cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291957 (https://phabricator.wikimedia.org/T136628) [06:19:11] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqord:xe-0/0/1 (Telia, IC-313592, 51ms) {#1502} [10Gbps wave]BR [06:20:10] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr1-ulsfo:xe-1/2/0 (Telia, IC-313592, 51ms) {#11372} [10Gbps wave]BR [06:25:50] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [06:26:50] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [06:29:50] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: puppet fail [06:30:10] PROBLEM - puppet last run on pc1006 is CRITICAL: CRITICAL: puppet fail [06:31:29] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:29] PROBLEM - puppet last run on chromium is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:39] PROBLEM - puppet last run on mw1009 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:10] PROBLEM - puppet last run on wtp2017 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:19] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:40] PROBLEM - puppet last run on mw2061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:20] (03CR) 10Nikerabbit: [C: 04-1] Beta: Enable Compact Language Links for new users (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291908 (https://phabricator.wikimedia.org/T136161) (owner: 10KartikMistry) [06:56:00] RECOVERY - puppet last run on chromium is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:56:10] RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:56:30] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:56:34] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor inline comment, otherwise LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/291716 (https://phabricator.wikimedia.org/T136406) (owner: 10Ladsgroup) [06:56:49] RECOVERY - puppet last run on pc1006 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:56:50] RECOVERY - puppet last run on wtp2017 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:56:59] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:58:00] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:01] PROBLEM - puppet last run on ms-be1019 is CRITICAL: CRITICAL: Puppet has 1 failures [07:00:17] (03CR) 10Alexandros Kosiaris: [C: 031] Partially port RESTBaseUpdateJobs to change propagation. [puppet] - 10https://gerrit.wikimedia.org/r/291201 (owner: 10Ppchelko) [07:00:29] RECOVERY - puppet last run on mw2061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:47] (03CR) 10Alexandros Kosiaris: [C: 031] MobileApps: use the provided request templates for API calls [puppet] - 10https://gerrit.wikimedia.org/r/291972 (owner: 10Mobrovac) [07:02:02] (03CR) 10Alexandros Kosiaris: [C: 032] Partially port RESTBaseUpdateJobs to change propagation. [puppet] - 10https://gerrit.wikimedia.org/r/291201 (owner: 10Ppchelko) [07:02:07] (03PS5) 10Alexandros Kosiaris: Partially port RESTBaseUpdateJobs to change propagation. [puppet] - 10https://gerrit.wikimedia.org/r/291201 (owner: 10Ppchelko) [07:02:11] (03CR) 10Alexandros Kosiaris: [V: 032] Partially port RESTBaseUpdateJobs to change propagation. [puppet] - 10https://gerrit.wikimedia.org/r/291201 (owner: 10Ppchelko) [07:05:45] !log change-prop restarting to apply https://gerrit.wikimedia.org/r/291201 [07:05:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:06:38] (03Abandoned) 10Alexandros Kosiaris: Add ores-admin to scb cluster [puppet] - 10https://gerrit.wikimedia.org/r/291944 (owner: 10Alexandros Kosiaris) [07:22:03] (03PS1) 10Muehlenhoff: Stop using package=>latest [puppet] - 10https://gerrit.wikimedia.org/r/292091 [07:24:43] RECOVERY - puppet last run on ms-be1019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:26:19] (03PS1) 10Muehlenhoff: Stop using package=>latest [puppet] - 10https://gerrit.wikimedia.org/r/292093 [07:32:56] !log restarted hhvm on mw1180 [07:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:33:27] RECOVERY - HHVM rendering on mw1180 is OK: HTTP OK: HTTP/1.1 200 OK - 67935 bytes in 0.205 second response time [07:34:17] RECOVERY - Apache HTTP on mw1180 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.081 second response time [07:36:13] YuviPanda: which one should I +1? [07:43:19] !log rolling reboot of scb in eqiad for update to Linux 4.4 [07:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:46:46] !log stopping kafka on kafka1020.eqiad and rebooting the host for Linux 4.4 upgrades [07:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:48:45] I am also going to silence all the analytics kafka brokers for 10 minutes to avoid all the alerts for under-replicated partitions [07:49:08] PROBLEM - Disk space on ms-be2012 is CRITICAL: DISK CRITICAL - free space: / 2114 MB (3% inode=96%) [07:56:43] !log event logging restarted on eventlog1001.eqiad.wmnet [07:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:57:49] 06Operations, 10DBA, 06Labs: disk failure on labsdb1002 - https://phabricator.wikimedia.org/T126946#2344085 (10jcrespo) I'm sorry, what? When did I say it was going to take 12 hours? My last estimation was: > Revision table is ongoing now, but it has 700 M rows and it takes almost half a day to import and f... [08:00:06] PROBLEM - dhclient process on kafka1020 is CRITICAL: Timeout while attempting connection [08:00:06] PROBLEM - MD RAID on kafka1020 is CRITICAL: Timeout while attempting connection [08:00:16] PROBLEM - Disk space on kafka1020 is CRITICAL: Timeout while attempting connection [08:00:16] PROBLEM - DPKG on kafka1020 is CRITICAL: Timeout while attempting connection [08:00:26] PROBLEM - SSH on kafka1020 is CRITICAL: Connection timed out [08:00:28] PROBLEM - jmxtrans on kafka1020 is CRITICAL: Timeout while attempting connection [08:00:47] PROBLEM - puppet last run on kafka1020 is CRITICAL: Timeout while attempting connection [08:01:07] PROBLEM - MegaRAID on kafka1020 is CRITICAL: Timeout while attempting connection [08:01:15] this is me sorry, first reboot didn't go as planned [08:01:16] PROBLEM - salt-minion processes on kafka1020 is CRITICAL: Timeout while attempting connection [08:01:23] I breached the 10 minutes downtime [08:01:32] the host is booting atm, so those will clear in a bit [08:01:32] PROBLEM - Kafka Broker Server on kafka1020 is CRITICAL: Timeout while attempting connection [08:01:58] just finished boot, good timing [08:02:06] RECOVERY - dhclient process on kafka1020 is OK: PROCS OK: 0 processes with command name dhclient [08:02:06] RECOVERY - MD RAID on kafka1020 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [08:02:16] RECOVERY - Disk space on kafka1020 is OK: DISK OK [08:02:16] RECOVERY - DPKG on kafka1020 is OK: All packages OK [08:02:17] RECOVERY - SSH on kafka1020 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [08:02:27] RECOVERY - jmxtrans on kafka1020 is OK: PROCS OK: 1 process with command name java, regex args -jar.+jmxtrans-all.jar [08:02:47] RECOVERY - puppet last run on kafka1020 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [08:03:06] RECOVERY - MegaRAID on kafka1020 is OK: OK: no disks configured for RAID [08:03:07] RECOVERY - salt-minion processes on kafka1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:03:22] RECOVERY - Kafka Broker Server on kafka1020 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties [08:04:02] I always like looking in to see thathe problem isn't a real problem. (when it's during work hours anyways) [08:04:47] I usually put the downtime to longer and then delete it after, not sure it matters that much [08:05:29] apergos: you are absolutely right, this time I wanted to try the 10 minutes downtime and failed [08:06:20] (even if it is really annoying to remove the downtime :P) [08:08:42] I don't love the setting up/removal of alerts in icinga. I wonder how shinken2 or anything else is [08:10:37] (03PS1) 10Alexandros Kosiaris: ores: Set the ores::web port to default on 8081 [puppet] - 10https://gerrit.wikimedia.org/r/292099 [08:12:14] (03CR) 10Alexandros Kosiaris: [C: 032] ores: Set the ores::web port to default on 8081 [puppet] - 10https://gerrit.wikimedia.org/r/292099 (owner: 10Alexandros Kosiaris) [08:14:01] (03CR) 10Alexandros Kosiaris: [C: 032] Add role::ores::web and roles::ores::worker [puppet] - 10https://gerrit.wikimedia.org/r/278989 (https://phabricator.wikimedia.org/T124201) (owner: 10Alexandros Kosiaris) [08:14:07] (03PS2) 10Alexandros Kosiaris: Add role::ores::web and roles::ores::worker [puppet] - 10https://gerrit.wikimedia.org/r/278989 (https://phabricator.wikimedia.org/T124201) [08:14:13] (03CR) 10Alexandros Kosiaris: [V: 032] Add role::ores::web and roles::ores::worker [puppet] - 10https://gerrit.wikimedia.org/r/278989 (https://phabricator.wikimedia.org/T124201) (owner: 10Alexandros Kosiaris) [08:14:55] !log reboot labnodepool1001 for update to Linux 4.4 [08:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:15:40] !log deleting mysql logrotate scripts to avoid root spam [08:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:19:11] (03CR) 10Mobrovac: [C: 031] "Ok, let's go" [puppet] - 10https://gerrit.wikimedia.org/r/291972 (owner: 10Mobrovac) [08:19:16] (03CR) 10Alexandros Kosiaris: [C: 032] MobileApps: use the provided request templates for API calls [puppet] - 10https://gerrit.wikimedia.org/r/291972 (owner: 10Mobrovac) [08:19:22] (03PS2) 10Alexandros Kosiaris: MobileApps: use the provided request templates for API calls [puppet] - 10https://gerrit.wikimedia.org/r/291972 (owner: 10Mobrovac) [08:19:26] (03CR) 10Alexandros Kosiaris: [V: 032] MobileApps: use the provided request templates for API calls [puppet] - 10https://gerrit.wikimedia.org/r/291972 (owner: 10Mobrovac) [08:22:04] !log Nodepool came back up just fine after labnodepool1001 reboot and is fully operational. [08:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:24:15] (03PS2) 10Giuseppe Lavagetto: systemd: add systemd::sidekick [puppet] - 10https://gerrit.wikimedia.org/r/291949 [08:24:26] !log depooled reboot of cp3049 (T131928) [08:24:27] T131928: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928 [08:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:25:18] !log mobileapps deploying 8d6d648 [08:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:27:37] PROBLEM - Host cp3049 is DOWN: PING CRITICAL - Packet loss = 100% [08:29:17] RECOVERY - Host cp3049 is UP: PING OK - Packet loss = 0%, RTA = 82.86 ms [08:44:08] (03CR) 10Alexandros Kosiaris: [C: 032] "merging after having address filippo's concern" [puppet] - 10https://gerrit.wikimedia.org/r/291219 (owner: 10Alexandros Kosiaris) [08:44:14] (03PS14) 10Alexandros Kosiaris: network::constants Split off labs into it's own realm [puppet] - 10https://gerrit.wikimedia.org/r/291219 [08:44:18] (03CR) 10Alexandros Kosiaris: [V: 032] network::constants Split off labs into it's own realm [puppet] - 10https://gerrit.wikimedia.org/r/291219 (owner: 10Alexandros Kosiaris) [08:50:36] (03CR) 10Alexandros Kosiaris: [C: 032] "https://puppet-compiler.wmflabs.org/3011/ NOOP according to PCC. Finally moving this into hiera" [puppet] - 10https://gerrit.wikimedia.org/r/291263 (owner: 10Alexandros Kosiaris) [08:50:44] (03PS6) 10Alexandros Kosiaris: networks::constants: Hieraize all_network_subnets [puppet] - 10https://gerrit.wikimedia.org/r/291263 [08:50:51] (03CR) 10Alexandros Kosiaris: [V: 032] networks::constants: Hieraize all_network_subnets [puppet] - 10https://gerrit.wikimedia.org/r/291263 (owner: 10Alexandros Kosiaris) [09:07:47] (03PS2) 10Alexandros Kosiaris: Assign roles::ores::web, roles::ores::worker to SCB [puppet] - 10https://gerrit.wikimedia.org/r/278990 (https://phabricator.wikimedia.org/T124201) [09:09:04] (03PS3) 10Hashar: (DO NOT SUBMIT) contint: pin chromium to 49 on Trusty [puppet] - 10https://gerrit.wikimedia.org/r/291116 (https://phabricator.wikimedia.org/T136188) [09:10:18] !log depooled reboot of cp3036 (T131928) [09:10:19] T131928: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928 [09:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:13:44] akosiaris: Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find data item network::subnets in any Hiera data file and no default supplied at /etc/puppet/modules/network/manifests/constants.pp:90 on node tools-services-01.tools.eqiad.wmflabs [09:14:02] akosiaris: on tools-services-01. I think that's related to https://gerrit.wikimedia.org/r/291263 ? [09:15:25] (03CR) 10Hashar: [C: 04-1] "It is held/pinned properly:" [puppet] - 10https://gerrit.wikimedia.org/r/291116 (https://phabricator.wikimedia.org/T136188) (owner: 10Hashar) [09:16:09] !log depooled reboot of cp3007 (T131928) [09:16:10] T131928: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928 [09:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:18:05] !log depooled reboot of cp3006 (T131928) [09:18:06] T131928: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928 [09:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:21:23] !log depooled reboot of cp3010 (T131928) [09:21:23] T131928: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928 [09:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:22:32] valhallasw`cloud: hmm, indeed. sigh, lemme see what I can do about that [09:22:46] thanks! :-) [09:23:32] !log depooled reboot of cp3045 (T131928) [09:23:33] T131928: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928 [09:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:26:37] PROBLEM - Host cp3045 is DOWN: PING CRITICAL - Packet loss = 100% [09:27:06] RECOVERY - Host cp3045 is UP: PING OK - Packet loss = 0%, RTA = 83.13 ms [09:28:55] !log depooled reboot of cp3039 (T131928) [09:28:56] T131928: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928 [09:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:37:26] !log installing libgd security updates [09:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:38:52] 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs: Implement role based hiera lookups for labs - https://phabricator.wikimedia.org/T120165#2344289 (10hashar) [09:39:40] (03PS1) 10Alexandros Kosiaris: network::subnets: Add the hiera variable in labs.yaml as well [puppet] - 10https://gerrit.wikimedia.org/r/292104 [09:39:54] 06Operations, 06Labs, 10Labs-Infrastructure, 10Monitoring: Have a paging check for Nova API accessible - https://phabricator.wikimedia.org/T133656#2344291 (10hashar) [09:40:17] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] network::subnets: Add the hiera variable in labs.yaml as well [puppet] - 10https://gerrit.wikimedia.org/r/292104 (owner: 10Alexandros Kosiaris) [09:42:42] valhallasw`cloud: ok fixed. [09:43:10] cool, thanks! :-) [09:44:56] (03CR) 10Hashar: [C: 031] "I hadn't seen it indeed :-] We have Xvfb on display port 94 for all slaves, but I guess it is not sufficient for the Android SDK. Maybe " [puppet] - 10https://gerrit.wikimedia.org/r/264303 (owner: 10Niedzielski) [09:58:37] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [09:59:17] akosiaris, is that you^? [10:00:07] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [10:02:00] ms-be2012 has a 39GB syslog, checking why [10:02:54] it seems access log is filling up syslog ? [10:05:27] I am moving it to another partition first to avoid / issues [10:06:27] "network::subnets: Add the hiera variable in labs.yaml as well" is pending for 25 minutes [10:09:40] !log depooled reboot of cp3035 (T131928) [10:09:41] T131928: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928 [10:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:10:15] !log depooled reboot of cp3008 (T131928) [10:10:16] T131928: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928 [10:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:11:09] RECOVERY - Disk space on ms-be2012 is OK: DISK OK [10:11:20] !log moved syslog1 to ms-be2012:/srv/swift-storage/sdl1/tmp to avoid / fillup [10:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:13:48] godog, can you check logging options of ms-be2012, it has an unusual large syslog. Can it be deleted? [10:14:47] !log depooled reboot of cp3037 (T131928) [10:14:48] T131928: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928 [10:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:14:54] jynus: he's on vacation for the next two weeks [10:15:50] oh, sorry [10:19:19] it is growing at around 1MB/s [10:21:00] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [10:21:15] jynus: yeah fixed. I did not notice it failed [10:22:25] actually, I saw another issue. Are you deploying now ores? [10:22:44] I saw it failing on deployment hosts [10:23:13] Error: /usr/bin/git -c core.sharedRepository=group clone --recurse-submodules https://gerrit.wikimedia.org/r/p/ores/deploy.git /srv/deployment/ores/deploy returned 128 instead of one of [0] [10:24:09] PROBLEM - puppet last run on db2023 is CRITICAL: CRITICAL: puppet fail [10:26:08] RECOVERY - puppet last run on db2023 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [10:28:29] !log depooled reboot of cp3009 (T131928) [10:28:30] T131928: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928 [10:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:29:49] I do not see a difference in config between swift nodes, but this one has a lot of logging [10:31:29] !log depooled reboot of cp3004 (T131928) [10:31:30] T131928: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928 [10:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:32:12] jynus: ah on tin. No I am not yet, but that's the plan for today. [10:32:47] jynus: lemme figure out why [10:35:04] !log depooled reboot of cp3047 (T131928) [10:35:05] T131928: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928 [10:35:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:38:44] !log depooled reboot of cp3044 (T131928) [10:38:45] T131928: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928 [10:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:39:25] 06Operations, 10MediaWiki-General-or-Unknown: Database error when filtering page log - https://phabricator.wikimedia.org/T136687#2344407 (10Danny_B) [10:39:36] !log depooled reboot of cp3005 (T131928) [10:39:37] T131928: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928 [10:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:40:40] (03PS1) 10Ppchelko: Enable RESTBase on office wiki [puppet] - 10https://gerrit.wikimedia.org/r/292109 [10:40:51] (03PS1) 10Alexandros Kosiaris: ores: Fix the deployment repo on deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/292110 [10:41:09] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 144 not-conn: cp3005_v4, cp3005_v6, cp3044_v4, cp3044_v6 [10:41:38] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 144 not-conn: cp3005_v4, cp3005_v6, cp3044_v4, cp3044_v6 [10:41:38] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 144 not-conn: cp3005_v4, cp3005_v6, cp3044_v4, cp3044_v6 [10:43:29] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 148 ESP OK [10:43:29] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 148 ESP OK [10:44:59] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 148 ESP OK [10:45:05] (03CR) 10Alexandros Kosiaris: [C: 04-1] "minor comments plus a question. Why not reuse base::service_unit (well wrap around it) for this ? the template is anyway fed into base::se" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/291949 (owner: 10Giuseppe Lavagetto) [10:45:16] (03CR) 10Alexandros Kosiaris: [C: 032] ores: Fix the deployment repo on deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/292110 (owner: 10Alexandros Kosiaris) [10:45:55] !log depooled reboot of cp3034 (T131928) [10:45:56] T131928: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928 [10:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:47:00] !log depooled reboot of cp3003 (T131928) [10:47:01] T131928: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928 [10:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:47:17] !log depooled reboot of cp3046 (T131928) [10:47:18] T131928: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928 [10:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:47:48] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [10:47:54] !log Script done for uca-it collation on itwiki: 10 599 758 rows processed [10:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:54:38] (03PS1) 10Gehel: Make sure that replicate-osm can not run in multiple instances. [puppet] - 10https://gerrit.wikimedia.org/r/292112 [10:54:41] 06Operations, 13Patch-For-Review: Boot time race condition when assembling root raid device on cp1052 - https://phabricator.wikimedia.org/T131961#2344426 (10ema) 05Open>03Resolved a:03ema I've just finished rebooting all cp* hosts in esams (T131928) and they came back up fine without raid issues. [10:57:09] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [10:58:06] (03PS1) 10Giuseppe Lavagetto: mediawiki::cgroup: systemd compatibility [puppet] - 10https://gerrit.wikimedia.org/r/292113 (https://phabricator.wikimedia.org/T131749) [11:05:06] (03PS2) 10Giuseppe Lavagetto: mediawiki::cgroup: systemd compatibility [puppet] - 10https://gerrit.wikimedia.org/r/292113 (https://phabricator.wikimedia.org/T131749) [11:05:44] !log rebooting cp3* spares (T131928) [11:05:45] T131928: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928 [11:05:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:13:19] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:15:08] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 6.719 second response time [11:15:58] PROBLEM - Host cp3015 is DOWN: PING CRITICAL - Packet loss = 100% [11:16:29] RECOVERY - Host cp3015 is UP: PING OK - Packet loss = 0%, RTA = 83.62 ms [11:21:29] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [11:26:29] PROBLEM - puppet last run on pc2006 is CRITICAL: CRITICAL: puppet fail [11:32:44] (03CR) 10Anomie: [C: 031] "This should do it, since we already have wmgBotPasswordsDatabase set false for private wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292053 (https://phabricator.wikimedia.org/T135074) (owner: 10Gergő Tisza) [11:34:40] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:34:58] 06Operations, 10DBA: Install, configure and provision recently arrived db core machines - https://phabricator.wikimedia.org/T133398#2344473 (10jcrespo) I've started installation, reporting issues on T135253. [11:35:38] 06Operations, 10DBA: Install, configure and provision recently arrived db core machines - https://phabricator.wikimedia.org/T133398#2344483 (10jcrespo) [11:35:40] 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and set up 16 db's db1079-1094 - https://phabricator.wikimedia.org/T135253#2293311 (10jcrespo) 05stalled>03Open a:05jcrespo>03Cmjohnson These are the issues I found: DHCP/network issues db1082 db1085 db1086 cannot connect to mgmt db1089 db1094 [11:38:28] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 6.566 second response time [11:38:32] I am installing 10 servers, hopefuly that will not create too much network impact [11:38:42] nah [11:38:57] I saw "Port utilisation over threshold" emails [11:47:50] (03PS2) 10Ppchelko: Enable RESTBase on office wiki [puppet] - 10https://gerrit.wikimedia.org/r/292109 (https://phabricator.wikimedia.org/T88016) [11:50:02] !log rebooting kafka1022 for kernel upgrade (4.4) [11:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:51:09] (03PS1) 10BBlack: nginx (1.11.1-1+wmf1) jessie; urgency=medium [software/nginx] (wmf-1.11.1) - 10https://gerrit.wikimedia.org/r/292117 [11:53:09] RECOVERY - puppet last run on pc2006 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [11:57:49] PROBLEM - puppet last run on maerlant is CRITICAL: CRITICAL: puppet fail [12:00:28] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:02:59] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 7.538 second response time [12:05:47] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, 03Discovery-Search-Sprint: Elasticsearch logs are not send to logstash after 2.3.3 upgrade - https://phabricator.wikimedia.org/T136696#2344551 (10Gehel) [12:08:08] (03CR) 10Gehel: [C: 032] Make sure that replicate-osm can not run in multiple instances. [puppet] - 10https://gerrit.wikimedia.org/r/292112 (owner: 10Gehel) [12:12:59] (03CR) 10BBlack: [C: 032 V: 032] nginx (1.11.1-1+wmf1) jessie; urgency=medium [software/nginx] (wmf-1.11.1) - 10https://gerrit.wikimedia.org/r/292117 (owner: 10BBlack) [12:22:36] 06Operations, 06Labs, 10netops: Intermittent bandwidth issue to labs proxy (eqiad) from Comcast in Portland OR - https://phabricator.wikimedia.org/T136671#2343755 (10faidon) Since you get bad //download// speeds, the opposite traceroute (from eqiad to you) is the more interesting one. I didn't have your IP,... [12:22:40] RECOVERY - puppet last run on maerlant is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [12:26:31] !log upgrade nginx to 1.11.1-1+wmf1 on all clusters [12:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:36:07] 06Operations, 10Traffic, 07Browser-Support-Firefox, 07HTTPS: Secure connection failed when attempting to send POST request using HTTP/2 (if connection has been idle for a certain time) - https://phabricator.wikimedia.org/T134869#2344610 (10BBlack) We've just upgraded our nginx package to 1.11.1, which incl... [12:39:45] (03PS2) 10Muehlenhoff: Add firejail profile and wrapper for ghostscript [puppet] - 10https://gerrit.wikimedia.org/r/291924 (https://phabricator.wikimedia.org/T135111) [12:45:44] 06Operations, 10Traffic, 07Browser-Support-Firefox, 07HTTPS: Secure connection failed when attempting to send POST request using HTTP/2 (if connection has been idle for a certain time) - https://phabricator.wikimedia.org/T134869#2344627 (10BBlack) Also note, whether or not the nginx update helps the situat... [12:49:33] !log draining cr2-codfw for firmware upgrade [12:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:51:11] (03PS3) 10Muehlenhoff: Add firejail profile and wrapper for ghostscript [puppet] - 10https://gerrit.wikimedia.org/r/291924 (https://phabricator.wikimedia.org/T135111) [12:58:30] PROBLEM - Host mw2144 is DOWN: PING CRITICAL - Packet loss = 100% [12:58:31] PROBLEM - Host cp2008 is DOWN: PING CRITICAL - Packet loss = 100% [12:58:41] PROBLEM - Host mw2097 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:00] PROBLEM - Host cp2009 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:00] PROBLEM - Host elastic2012 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:01] PROBLEM - Host es2004 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:01] PROBLEM - Host es2003 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:01] PROBLEM - Host elastic2011 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:01] PROBLEM - Host labstore2004 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:11] PROBLEM - Host db2018 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:11] PROBLEM - Host db2017 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:11] PROBLEM - Host db2028 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:11] PROBLEM - Host db2023 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:11] PROBLEM - Host db2030 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:12] PROBLEM - Host db2029 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:12] PROBLEM - Host db2019 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:13] PROBLEM - Host db2016 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:20] PROBLEM - Host cp2010 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:20] PROBLEM - Host restbase2002 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:40] PROBLEM - Host cp2012 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:47] a whole rack down? [12:59:51] PROBLEM - Host ganeti2006 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:51] PROBLEM - Host ganeti2005 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:51] PROBLEM - Host lvs2004 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:51] PROBLEM - Host es2011 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:51] PROBLEM - Host elastic2010 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:51] PROBLEM - Host elastic2008 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:51] PROBLEM - Host maps-test2004 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:52] PROBLEM - Host elastic2009 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:52] PROBLEM - Host elastic2007 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:53] PROBLEM - Host eventlog2001 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:53] PROBLEM - Host ms-be2006 is DOWN: PING CRITICAL - Packet loss = 100% [12:59:54] PROBLEM - Host furud is DOWN: PING CRITICAL - Packet loss = 100% [13:00:10] PROBLEM - Host cp2007 is DOWN: PING CRITICAL - Packet loss = 100% [13:00:20] RECOVERY - Host restbase2002 is UP: PING WARNING - Packet loss = 50%, RTA = 36.30 ms [13:00:20] RECOVERY - Host elastic2010 is UP: PING OK - Packet loss = 0%, RTA = 36.20 ms [13:00:21] PROBLEM - Host sca2002 is DOWN: PING CRITICAL - Packet loss = 100% [13:00:21] PROBLEM - Host restbase2007 is DOWN: PING CRITICAL - Packet loss = 100% [13:00:21] PROBLEM - Host restbase-test2003 is DOWN: PING CRITICAL - Packet loss = 100% [13:00:21] PROBLEM - Host wtp2003 is DOWN: PING CRITICAL - Packet loss = 100% [13:00:40] eek meta is down [13:01:18] yeah we probably lost networking to a whole rack or row as an indirect fallout of 12:49 < paravoid> !log draining cr2-codfw for firmware upgrade [13:01:21] paravoid: ^ [13:01:37] actually eqiad and codfw [13:01:44] eqiad? [13:01:57] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 94, down: 3, dormant: 0, excluded: 0, unused: 0BRae2: down - Core: asw-b-codfw:ae2BRae3: down - Core: asw-c-codfw:ae2BRae4: down - Core: asw-d-codfw:ae2BR [13:02:01] I see only 2nnnn hostnames above [13:02:02] sorry, those are ipsec only [13:02:07] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [13:02:07] right [13:02:08] RECOVERY - Host db2016 is UP: PING OK - Packet loss = 0%, RTA = 36.42 ms [13:02:08] RECOVERY - Host wtp2010 is UP: PING OK - Packet loss = 0%, RTA = 36.70 ms [13:02:13] argh [13:02:16] RECOVERY - Host elastic2008 is UP: PING OK - Packet loss = 0%, RTA = 38.22 ms [13:02:16] RECOVERY - Host es2003 is UP: PING OK - Packet loss = 0%, RTA = 38.09 ms [13:02:17] RECOVERY - Host db2018 is UP: PING OK - Packet loss = 0%, RTA = 36.22 ms [13:02:27] RECOVERY - Host ganeti2006 is UP: PING OK - Packet loss = 0%, RTA = 36.28 ms [13:02:27] PROBLEM - cassandra-b CQL 10.192.16.163:9042 on restbase2001 is CRITICAL: Connection timed out [13:02:27] should we put codfw down? [13:02:33] no [13:02:36] PROBLEM - salt-minion processes on es2003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [13:02:36] RECOVERY - Host mw2100 is UP: PING OK - Packet loss = 0%, RTA = 36.32 ms [13:02:37] RECOVERY - Host mw2089 is UP: PING OK - Packet loss = 0%, RTA = 37.14 ms [13:02:45] (well, no if you mean in geodns) [13:02:46] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [13:02:57] RECOVERY - Host mw2108 is UP: PING OK - Packet loss = 0%, RTA = 36.99 ms [13:03:07] RECOVERY - Host mw2111 is UP: PING OK - Packet loss = 0%, RTA = 36.74 ms [13:03:15] lvs/pybal will work around a one-rack/one-row issue in codfw for cache traffic [13:03:17] RECOVERY - Host db2023 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [13:03:17] RECOVERY - Host wtp2003 is UP: PING OK - Packet loss = 0%, RTA = 37.10 ms [13:03:26] RECOVERY - Host db2028 is UP: PING OK - Packet loss = 0%, RTA = 36.94 ms [13:03:26] RECOVERY - Host mc2011 is UP: PING OK - Packet loss = 0%, RTA = 37.00 ms [13:03:26] RECOVERY - Host mw2139 is UP: PING OK - Packet loss = 0%, RTA = 36.94 ms [13:03:27] RECOVERY - Host mw2115 is UP: PING OK - Packet loss = 0%, RTA = 37.70 ms [13:03:27] RECOVERY - Host mw2103 is UP: PING OK - Packet loss = 0%, RTA = 36.24 ms [13:03:27] RECOVERY - Host mw2131 is UP: PING OK - Packet loss = 0%, RTA = 36.93 ms [13:03:30] (or should anyways) [13:03:31] it wasn't a one rack/one row issue, though [13:03:37] RECOVERY - Host ganeti2001 is UP: PING OK - Packet loss = 0%, RTA = 36.33 ms [13:03:40] Glaisher, are you sure meta doesn't work? [13:03:40] oh [13:03:41] it shouldn't have been any rack/row issue :) [13:03:46] we have two routers [13:03:47] RECOVERY - Host mw2109 is UP: PING OK - Packet loss = 0%, RTA = 36.37 ms [13:03:47] RECOVERY - Host mw2113 is UP: PING OK - Packet loss = 0%, RTA = 37.19 ms [13:03:47] RECOVERY - Host mw2136 is UP: PING OK - Packet loss = 0%, RTA = 37.16 ms [13:03:47] RECOVERY - Host es2011 is UP: PING OK - Packet loss = 0%, RTA = 38.66 ms [13:03:47] PROBLEM - puppet last run on mw2138 is CRITICAL: CRITICAL: Puppet has 1 failures [13:03:51] yes, throwing wikimedia error [13:03:57] still [13:03:57] RECOVERY - Host mw2127 is UP: PING OK - Packet loss = 0%, RTA = 37.41 ms [13:03:58] paravoid: I guess I missed that icinga quit on us for a bit [13:04:07] RECOVERY - Host ganeti2005 is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms [13:04:17] PROBLEM - puppet last run on mw2136 is CRITICAL: CRITICAL: Puppet has 1 failures [13:04:26] RECOVERY - Host sca2002 is UP: PING OK - Packet loss = 0%, RTA = 43.93 ms [13:04:27] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] [13:04:31] Glaisher, which continent are you on? [13:04:37] RECOVERY - Host mw2110 is UP: PING OK - Packet loss = 0%, RTA = 36.25 ms [13:04:45] he's hitting codfw edge almost certainly, for whatever reason [13:04:47] RECOVERY - Host db2017 is UP: PING OK - Packet loss = 0%, RTA = 37.56 ms [13:04:47] RECOVERY - Host mw2120 is UP: PING OK - Packet loss = 0%, RTA = 37.33 ms [13:04:47] asia [13:04:49] or ulsfo [13:04:54] well yeah, if it's misses [13:05:07] seems to be back now, no? [13:05:17] RECOVERY - Host mw2119 is UP: PING OK - Packet loss = 0%, RTA = 37.87 ms [13:05:22] still no idea why this happened [13:05:37] RECOVERY - Host ms-fe2003 is UP: PING OK - Packet loss = 0%, RTA = 37.24 ms [13:05:38] ganeti down, and there are something there with no redundancy [13:05:49] nothing is down [13:05:50] including restbase [13:05:53] getting continuous error ::: Our servers are currently under maintenance or experiencing a technical problem. This is probably temporary and should be fixed soon [13:05:56] RECOVERY - Host ms-be2014 is UP: PING OK - Packet loss = 0%, RTA = 36.44 ms [13:05:58] been a few minutes [13:06:06] still getting errors? [13:06:06] RECOVERY - Host mw2106 is UP: PING OK - Packet loss = 0%, RTA = 36.31 ms [13:06:06] RECOVERY - Host mw2126 is UP: PING OK - Packet loss = 0%, RTA = 38.56 ms [13:06:10] https://en.wikisource.org/wiki/The_Rebellion_in_the_Cevennes [13:06:17] yeah still 503, just checked [13:06:18] RECOVERY - Host restbase-test2002 is UP: PING OK - Packet loss = 0%, RTA = 37.95 ms [13:06:26] RECOVERY - Host mw2129 is UP: PING OK - Packet loss = 16%, RTA = 36.38 ms [13:06:26] RECOVERY - Host ms-be2018 is UP: PING OK - Packet loss = 0%, RTA = 36.05 ms [13:06:31] paravoid: yep, still [13:06:36] I'm going to revert all of my changes, but I still don't understand why [13:06:44] pings from bast4001 to text-lb/upload-lb.codfw work [13:06:46] RECOVERY - Host mw2099 is UP: PING OK - Packet loss = 0%, RTA = 37.64 ms [13:06:46] RECOVERY - Host rdb2003 is UP: PING OK - Packet loss = 0%, RTA = 37.12 ms [13:06:47] RECOVERY - Host mw2147 is UP: PING OK - Packet loss = 0%, RTA = 36.20 ms [13:06:53] GETs as well [13:06:55] I'm likely to be Asia [13:06:56] RECOVERY - Host mw2086 is UP: PING WARNING - Packet loss = 73%, RTA = 37.51 ms [13:07:07] RECOVERY - Host furud is UP: PING OK - Packet loss = 0%, RTA = 37.03 ms [13:07:08] anyways, depooling codfw front edge first [13:07:17] RECOVERY - Host mw2088 is UP: PING OK - Packet loss = 0%, RTA = 36.22 ms [13:07:27] RECOVERY - Host mw2085 is UP: PING OK - Packet loss = 0%, RTA = 38.84 ms [13:07:27] RECOVERY - cassandra-b CQL 10.192.16.163:9042 on restbase2001 is OK: TCP OK - 3.045 second response time on port 9042 [13:07:37] RECOVERY - Host elastic2012 is UP: PING OK - Packet loss = 0%, RTA = 36.67 ms [13:07:37] RECOVERY - Host lvs2006 is UP: PING OK - Packet loss = 0%, RTA = 36.72 ms [13:07:37] RECOVERY - Host mc2007 is UP: PING OK - Packet loss = 0%, RTA = 37.19 ms [13:07:37] RECOVERY - Host elastic2007 is UP: PING OK - Packet loss = 0%, RTA = 37.26 ms [13:07:37] RECOVERY - Host graphite2001 is UP: PING OK - Packet loss = 0%, RTA = 36.37 ms [13:07:42] is it any better now? [13:07:44] (03PS1) 10BBlack: depool codfw [dns] - 10https://gerrit.wikimedia.org/r/292127 [13:07:52] yep, thanks [13:07:52] paravoid: yes, at least for me [13:07:57] bblack: hold that [13:08:01] yes [13:08:04] yeah it works now [13:08:10] I'm getting served by ulsfo, not codfw [13:08:11] what the hell.. [13:08:17] back! [13:08:17] I was testing through codfw with a cache miss direct to codfw frontends [13:08:30] from where? [13:08:32] Glaisher: ulsfo uses codfw on cache miss [13:08:41] ah [13:08:45] this is strange [13:08:51] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [13:09:45] 07Puppet, 10Phragile, 06TCB-Team, 07Composer, and 2 others: Puppet fail due to composer install on Phragile instance - https://phabricator.wikimedia.org/T133967#2344656 (10Tobi_WMDE_SW) [13:09:51] FTR: the way to test a cache miss through a specific frontend DC is like: [13:10:07] curl -v https://en.wikipedia.org/wiki/Main_Page?asdfasdfxx --resolve en.wikipedia.org:443:208.80.153.224 [13:10:21] but replace asdfasdfxx with something unique from you, to make sure it's not cached with anyone else [13:10:37] and the IP at the end is what "host text-lb.codfw.wikimedia.org" gives [13:10:54] so throughout this issue, I could reach text-lb/upload-lb codfw from bast4001 [13:10:56] we actually got varnish answering 5xx ? [13:11:10] so why did it actually fail for those users, I wonder [13:11:15] yes, I saw 503's via that curl command multiple times initially [13:11:31] but when I re-checked after uploading the DNS patch, it was fixed [13:11:37] PROBLEM - puppet last run on ms-be2018 is CRITICAL: CRITICAL: puppet fail [13:11:39] it wasn't a question but a statement, asking why [13:11:46] RECOVERY - salt-minion processes on es2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:11:47] PROBLEM - puppet last run on mw2088 is CRITICAL: CRITICAL: puppet fail [13:11:47] (rethoric question) [13:11:47] PROBLEM - puppet last run on mw2125 is CRITICAL: CRITICAL: puppet fail [13:11:50] yeah, because I reverted the OSPF/uplink changes on cr2 in the meantime [13:11:56] PROBLEM - Redis status tcp_6379 on rdb2004 is CRITICAL: Return code of 255 is out of bounds [13:12:16] PROBLEM - Redis status tcp_6380 on rdb2004 is CRITICAL: Return code of 255 is out of bounds [13:12:17] PROBLEM - Redis status tcp_6381 on rdb2004 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.16.123 on port 6381 [13:12:26] PROBLEM - puppet last run on graphite2001 is CRITICAL: CRITICAL: puppet fail [13:12:28] PROBLEM - puppet last run on mc2008 is CRITICAL: CRITICAL: puppet fail [13:12:31] paravoid: I assume the via-ulsfo failures were on requests that were full miss/pass [13:12:43] or at least, miss/pass in both layers at ulsfo [13:12:46] PROBLEM - puppet last run on mc2010 is CRITICAL: CRITICAL: puppet fail [13:12:52] sure, but *why*? codfw was reachable [13:13:04] hm, I didn't try cpNNNN [13:13:13] stupidly enough [13:13:14] yeah but codfw couldn't reach eqiad mediawiki [13:13:24] well, couldn't reach eqiad caches, actually [13:13:38] RECOVERY - Redis status tcp_6379 on rdb2004 is OK: OK: REDIS on 10.192.16.123:6379 has 1 databases (db0) with 10646693 keys - replication_delay is 0 [13:13:57] RECOVERY - Redis status tcp_6380 on rdb2004 is OK: OK: REDIS on 10.192.16.123:6380 has 1 databases (db0) with 10650623 keys - replication_delay is 0 [13:14:12] acoording to 5xx, incident started at 13:01 and ended at 13:12 [13:14:16] RECOVERY - Redis status tcp_6381 on rdb2004 is OK: OK: REDIS on 10.192.16.123:6381 has 1 databases (db0) with 10624848 keys - replication_delay is 0 [13:14:21] hmmm, maybe loading the firmware package actually reset FPC 5 [13:14:23] that would explain a lot [13:14:47] contrary to documentation and common sense... [13:15:23] total requests count doesn't load to me, to see if there was a slowdown of 200 [13:15:27] [01/Jun/2016:13:01:17 +0000] "GET /junos/jfirmware-13.3R8.7-signed.tgz HTTP/1.1" [13:16:19] and just to circle back to earlier things: yeah it wasn't rack/row-specific, that was my initial assumption from the short list before icinga-wm dropped. the codfw caches are spread evenly across all the rows, for every cluster, so if they fail totally, it's all. [13:16:30] bblack, same for me [13:16:53] because I saw at first like a couple of dbs, then a couple of cps, etc [13:17:21] so my commit of disabling the uplinks was at ~12:57 [13:17:38] PROBLEM - puppet last run on planet2001 is CRITICAL: CRITICAL: puppet fail [13:17:54] which also reset the pybal BGP sessions a few seconds later [13:17:57] I am waiting for logs from cps to arrive [13:18:06] (not logs, metrics) [13:18:22] 15:58 < icinga-wm> PROBLEM - Host mw2144 is DOWN: PING CRITICAL - Packet loss = 100% [13:18:30] this is when the first downtime started [13:18:43] so the uplink part did it, not jfirmware [13:18:47] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:19:46] cp3 and cp4 got to 0 traffic (but I do not know if it is the metrics failing, too) [13:19:57] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:20:16] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:20:46] is it possible that the link-down isolated ulsfo+codfw from eqiad but not each other? [13:21:13] well ulsfo I guess not, since it wasn't in the icinga spam [13:21:41] but maybe it only broke codfw<->eqiad, and didn't isolate codfw from ulsfo [13:22:31] also kind of worrying that we got no pages [13:22:48] I didn't either and I'm on 24/7 [13:22:49] I was just writing the same thing [13:23:24] I think the status thing may send something? [13:23:34] no text-lb/upload-lb alerts were issued [13:23:42] (or other -lb, obviously) [13:28:07] RECOVERY - puppet last run on mw2136 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [13:28:07] PROBLEM - puppet last run on lvs4002 is CRITICAL: CRITICAL: puppet fail [13:28:48] [2016-06-01 13:04:56] SERVICE ALERT: text-lb.codfw.wikimedia.org;LVS HTTPS IPv4;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 15599 bytes in 0.318 second response time [13:28:57] [2016-06-01 13:03:08] SERVICE ALERT: text-lb.codfw.wikimedia.org;LVS HTTPS IPv4;CRITICAL;SOFT;1;HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 2319 bytes in 0.168 second response time [13:29:13] icinga's logs says it did alert [13:29:23] well, logged an alert, but a soft one [13:29:28] RECOVERY - puppet last run on mw2138 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:29:31] I guess it never got to 3/3? [13:29:38] RECOVERY - puppet last run on planet2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:29:47] RECOVERY - puppet last run on mw2088 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [13:30:27] RECOVERY - puppet last run on graphite2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:30:37] RECOVERY - puppet last run on mc2008 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [13:32:06] RECOVERY - puppet last run on mw2125 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:32:41] (03CR) 10Giuseppe Lavagetto: [C: 031] "The change seems ok, did you check the production puppetmasters so that the auth.conf file we're writing is equivalent to what is now live" [puppet] - 10https://gerrit.wikimedia.org/r/284103 (owner: 10Andrew Bogott) [13:34:56] RECOVERY - puppet last run on mc2010 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [13:35:47] RECOVERY - puppet last run on ms-be2018 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [13:43:59] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, 03Discovery-Search-Sprint: Increase time before alter for elasticsearch disk space issues - https://phabricator.wikimedia.org/T136702#2344706 (10Gehel) [13:44:52] (03PS8) 10Gehel: Increase time before alter for elasticsearch disk space issues [puppet] - 10https://gerrit.wikimedia.org/r/290487 (https://phabricator.wikimedia.org/T136702) [13:45:01] (03PS1) 10BBlack: route around codfw in cache::route_table where possible [puppet] - 10https://gerrit.wikimedia.org/r/292133 [13:48:53] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, and 2 others: Increase time before alter for elasticsearch disk space issues - https://phabricator.wikimedia.org/T136702#2344728 (10Gehel) Current concerns from @faidon are: * the nrpe module is not the right place to implement a disk... [13:49:05] 06Operations, 10Monitoring, 10Traffic: Add LVS public endpoint checks that bypass caches - https://phabricator.wikimedia.org/T136703#2344731 (10BBlack) [13:50:20] 06Operations, 10Monitoring, 10Traffic: Add LVS public endpoint checks that bypass caches - https://phabricator.wikimedia.org/T136703#2344744 (10BBlack) p:05Triage>03High [13:51:09] (03CR) 10BBlack: [C: 032] depool codfw [dns] - 10https://gerrit.wikimedia.org/r/292127 (owner: 10BBlack) [13:51:18] (03CR) 10BBlack: [C: 032] route around codfw in cache::route_table where possible [puppet] - 10https://gerrit.wikimedia.org/r/292133 (owner: 10BBlack) [13:51:42] (03PS2) 10KartikMistry: Beta: Enable Compact Language Links for new users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291908 (https://phabricator.wikimedia.org/T136161) [13:54:26] RECOVERY - puppet last run on lvs4002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:55:21] (03PS1) 10Giuseppe Lavagetto: apache::mod_conf: jessie compatibility [puppet] - 10https://gerrit.wikimedia.org/r/292135 (https://phabricator.wikimedia.org/T131749) [14:00:57] (03PS4) 10Muehlenhoff: Add firejail profile and wrapper for ghostscript [puppet] - 10https://gerrit.wikimedia.org/r/291924 (https://phabricator.wikimedia.org/T135111) [14:01:25] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add firejail profile and wrapper for ghostscript [puppet] - 10https://gerrit.wikimedia.org/r/291924 (https://phabricator.wikimedia.org/T135111) (owner: 10Muehlenhoff) [14:02:05] (03PS5) 10Gehel: Adding Icinga checks for Maps [puppet] - 10https://gerrit.wikimedia.org/r/291023 (https://phabricator.wikimedia.org/T135647) [14:03:08] (03PS6) 10Gehel: Adding Icinga checks for Maps [puppet] - 10https://gerrit.wikimedia.org/r/291023 (https://phabricator.wikimedia.org/T135647) [14:03:10] (03CR) 10jenkins-bot: [V: 04-1] Adding Icinga checks for Maps [puppet] - 10https://gerrit.wikimedia.org/r/291023 (https://phabricator.wikimedia.org/T135647) (owner: 10Gehel) [14:05:26] 06Operations, 06Labs, 10netops: Intermittent bandwidth issue to labs proxy (eqiad) from Comcast in Portland OR - https://phabricator.wikimedia.org/T136671#2344776 (10brion) Thanks, I'll keep an eye out tonight and see if it gets congested again (currently seeing a cool 80 Mbits download rate at 7:04am pacifi... [14:05:47] anyone have any suggestions for tracking down disk reads; tracking down what is being read on disk? [14:06:03] (03PS7) 10Gehel: Adding Icinga checks for Maps [puppet] - 10https://gerrit.wikimedia.org/r/291023 (https://phabricator.wikimedia.org/T135647) [14:06:36] i.e. assuming you had high read throughput, and you knew the process, how would you determine *what* it was reading? [14:06:54] (assuming the process was uncooperative on the matter) [14:07:40] (03PS1) 10Giuseppe Lavagetto: hhvm::debug: jessie compatibility [puppet] - 10https://gerrit.wikimedia.org/r/292140 (https://phabricator.wikimedia.org/T131749) [14:08:41] !log depooled reboot of cp1* hosts (T131928) [14:08:42] T131928: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928 [14:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:12:42] urandom: as in which file or what part of the file? [14:12:53] lsof will list paths of open file descriptors [14:13:22] can always strace for fun i guess \o/ [14:13:51] brion: thing is, there are lots of open files, which is normal, so lots of noise and no signal [14:14:02] ah fun times [14:14:10] brion: and strace probably will work, but it's java, so... yeah [14:14:17] could strace it, grep for reads, then check which fd is the most ..... sounds painful though [14:14:31] would be nice if there was a tool that said 'reading 50 mbits from this file' [14:14:42] brion: that's the tool i'm looking for! [14:14:46] hehe [14:15:07] urandom: is this about what I saw last night with ganglia reporting more read traffic than cassandra metrics? first guess would be "something other than cassandra is also reading" [14:15:26] !log Disabling OSPF on all cr2-codfw row subnets to drain FPC0 [14:15:29] heads-up! [14:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:16:06] brion: no, that turned out to be my usage of iostat, i think [14:16:14] ok [14:16:17] (03PS2) 10Giuseppe Lavagetto: apache::mod_conf: jessie compatibility [puppet] - 10https://gerrit.wikimedia.org/r/292135 (https://phabricator.wikimedia.org/T131749) [14:21:37] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/3018/ shows a noop, hopefully :)" [puppet] - 10https://gerrit.wikimedia.org/r/292135 (https://phabricator.wikimedia.org/T131749) (owner: 10Giuseppe Lavagetto) [14:29:22] !log Disabling cr2-codfw et-0/0/1 (row B uplink) [14:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:34:30] !log Disabling cr2-codfw et-0/0/0 (row A uplink) [14:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:34:38] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 100, down: 2, dormant: 0, excluded: 0, unused: 0BRae1: down - Core: asw-a-codfw:ae2BRae2: down - Core: asw-b-codfw:ae2BR [14:34:49] took 5 whole minutes :/ [14:35:17] (03PS8) 10Gehel: Adding Icinga checks for Maps [puppet] - 10https://gerrit.wikimedia.org/r/291023 (https://phabricator.wikimedia.org/T135647) [14:35:30] it took 5 minutes to complete downing one row uplink port? [14:35:34] no [14:35:36] for icinga to notice [14:35:57] between the row B uplink being disabled and the icinga alert [14:36:09] becauser icinga doesn't alert until N fails in a row [14:36:16] we can tune that if we want it faster and flappier [14:36:39] I'm not sure if that was the reason here, icinga is probably backlogged [14:37:57] [2016-06-01 14:31:48] SERVICE ALERT: cr2-codfw;Router interfaces;CRITICAL;SOFT;1;CRITICAL: host '208.80.153.193', interfaces up: 110, down: 1, dormant: 0, excluded: 0, unused: 0
ae2: down -> Core: asw-b-codfw:ae2
[14:38:01] [2016-06-01 14:32:57] SERVICE ALERT: cr2-codfw;Router interfaces;CRITICAL;SOFT;2;CRITICAL: host '208.80.153.193', interfaces up: 110, down: 1, dormant: 0, excluded: 0, unused: 0
ae2: down -> Core: asw-b-codfw:ae2
[14:38:05] [2016-06-01 14:34:38] SERVICE ALERT: cr2-codfw;Router interfaces;CRITICAL;HARD;3;CRITICAL: host '208.80.153.193', interfaces up: 100, down: 2, dormant: 0, excluded: 0, unused: 0
ae1: down -> Core: asw-a-codfw:ae2
ae2: down -> Core: asw-b-codfw:ae2
[14:38:16] it first noticed at 14:31, but didn't alert until 3 hits at 14:34 [14:38:48] apergos: yt? [14:39:02] ottomata: yep. what's up? [14:39:10] hiya, the key you have in pwstore is expired :/ [14:39:22] argh [14:39:26] sorry [14:39:30] hehe, np [14:39:30] :) [14:40:11] moritzm: can advise, but i think you need to somehow get a new key up with the same id? or extend expiry? not sure how that works [14:41:13] bblack: yeah, but these are supposed to be with a 1-minute interval [14:41:23] 3 times, 1 minute between each [14:41:45] so avg time to alert should be 2.5 mins, and backlog made it more like 5 [14:41:48] so you see how it lags behind 9 seconds the second time, then 41 seconds the third time [14:42:03] yup [14:42:48] !log Disabling cr2-codfw et-0/2/0, et-0/2/1 (row C/D uplinks) [14:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:43:12] well icinga-backlog issues aside, maybe for routers we should still do it faster. maybe drop it to 30s interval and 2x fail to alert? [14:43:48] ok, so we're not at where we were before [14:43:51] only no outage this time [14:44:03] so probably just timing issues/races :/ [14:44:18] possibly for ulsfo specifically, as the MX80s are particularly slow [14:44:57] (03PS2) 10Giuseppe Lavagetto: hhvm::debug: jessie compatibility [puppet] - 10https://gerrit.wikimedia.org/r/292140 (https://phabricator.wikimedia.org/T131749) [14:47:00] ottomata: I've renewed it on the keyservers [14:47:28] k apergos thanks trying [14:47:36] (03CR) 10Andrew Bogott: "Yep, the template in the patch (auth-prod-master.conf.erb) is based off of the file that's live on palladium." [puppet] - 10https://gerrit.wikimedia.org/r/284103 (owner: 10Andrew Bogott) [14:48:03] !log Upgrading cr2-codfw FPC 0 all PICs firmware [14:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:49:36] !log Rebooting cr2-codfw FPC 0 [14:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:50:12] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] hhvm::debug: jessie compatibility [puppet] - 10https://gerrit.wikimedia.org/r/292140 (https://phabricator.wikimedia.org/T131749) (owner: 10Giuseppe Lavagetto) [14:52:54] (03PS1) 10Aklapper: Adjust "monthly Phab community metrics" SQL query to upstream changes [puppet] - 10https://gerrit.wikimedia.org/r/292149 (https://phabricator.wikimedia.org/T136667) [14:54:11] (03CR) 10Aklapper: [C: 04-1] Adjust "monthly Phab community metrics" SQL query to upstream changes [puppet] - 10https://gerrit.wikimedia.org/r/292149 (https://phabricator.wikimedia.org/T136667) (owner: 10Aklapper) [14:54:19] (03PS9) 10Andrew Bogott: Allow horizon to query the labs puppetmaster for a list of classes [puppet] - 10https://gerrit.wikimedia.org/r/284103 [14:54:20] reboot done [14:54:24] _joe_: https://gerrit.wikimedia.org/r/#/c/292140/2/modules/hhvm/manifests/debug.pp you added $perftools_package, but you don't use it [14:54:36] (03PS1) 10Giuseppe Lavagetto: hhvm::debug: brown-paper-bag followup to I383bd23196 [puppet] - 10https://gerrit.wikimedia.org/r/292150 [14:54:51] !log Re-enabling cr2-codfw et-0/* interfaces [14:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:55:00] oh, you created a patchset for that now, cool [14:55:07] <_joe_> :) [14:55:17] <_joe_> SPF|Cloud: yeah pretty embarassing :P [14:55:22] hehe [14:55:34] (03PS2) 10Aklapper: Adjust "monthly Phab community metrics" SQL query to upstream changes [puppet] - 10https://gerrit.wikimedia.org/r/292149 (https://phabricator.wikimedia.org/T136667) [14:55:47] (03CR) 10Giuseppe Lavagetto: [C: 032] hhvm::debug: brown-paper-bag followup to I383bd23196 [puppet] - 10https://gerrit.wikimedia.org/r/292150 (owner: 10Giuseppe Lavagetto) [14:55:47] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:55:58] (03CR) 10Giuseppe Lavagetto: [V: 032] hhvm::debug: brown-paper-bag followup to I383bd23196 [puppet] - 10https://gerrit.wikimedia.org/r/292150 (owner: 10Giuseppe Lavagetto) [14:56:21] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, 03Discovery-Search-Sprint: Elasticsearch logs are not send to logstash after 2.3.3 upgrade - https://phabricator.wikimedia.org/T136696#2344871 (10Gehel) Error seems related to the new elasticsearch security manager. It is now more st... [14:56:47] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [14:57:02] !log Re-enabling OSPF on all cr2-codfw row subnets [14:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:57:15] !log depooled reboot of cp3048 (T131928) [14:57:15] T131928: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928 [14:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:57:47] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [14:58:53] (03CR) 10Andrew Bogott: [C: 032] Allow horizon to query the labs puppetmaster for a list of classes [puppet] - 10https://gerrit.wikimedia.org/r/284103 (owner: 10Andrew Bogott) [14:59:05] !log Restoring VRRP priority on cr2-codfw [14:59:10] (03PS10) 10Andrew Bogott: Allow horizon to query the labs puppetmaster for a list of classes [puppet] - 10https://gerrit.wikimedia.org/r/284103 [14:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:59:33] ok, cr2-codfw is VRRP master now [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160601T1500). Please do the needful. [15:00:04] Lokal_Profil matej_suchanek kart_ Dereckson tgr: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:19] Hello. [15:00:32] o/ [15:00:36] me around [15:00:42] The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later. behhh no :( [15:00:50] SPF|Cloud: where? [15:00:50] * kart_ around [15:00:54] I can SWAT, lots of things... [15:00:55] Gerrit [15:01:18] oh good. Yeah I see the same thing :( [15:01:20] thcipriani: so no you can't SWAT [15:01:23] gerrit issues? [15:01:30] looks like it [15:01:30] getting a 503 on gerrit [15:01:30] that's me [15:01:34] oh [15:01:35] ignore it please [15:01:49] thcipriani: https://gerrit.wikimedia.org/r/#/c/292103/ is already merged, but not deployed [15:01:59] * thcipriani looks [15:02:02] !log restarted gerrit to enforce 100m maxObjectSizeLimit [15:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:02:12] thcipriani: Gerrit up again [15:02:16] sorry about that, my muscle memory is from the time when submodule updates had to be hand-prepared [15:02:18] Back to business. [15:02:33] tgr: np, looks like that one requires a full scap [15:03:01] !log Disabling OSPF on all cr1-codfw row subnets to drain FPC0 [15:03:01] !log restarted grrrit-wm after gerrit restart [15:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:03:17] PROBLEM - NTP on cp3048 is CRITICAL: NTP CRITICAL: Offset unknown [15:03:23] tgr: so I may save that one until the end if it's alright with you. [15:03:37] sure [15:03:38] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288582 (https://phabricator.wikimedia.org/T135212) (owner: 10Lokal Profil) [15:04:15] (03Merged) 10jenkins-bot: Make SUL icons square and use global defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288582 (https://phabricator.wikimedia.org/T135212) (owner: 10Lokal Profil) [15:04:18] (03PS3) 10Thcipriani: Remove no longer used Echo configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291218 (https://phabricator.wikimedia.org/T58037) (owner: 10Matěj Suchánek) [15:05:10] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291218 (https://phabricator.wikimedia.org/T58037) (owner: 10Matěj Suchánek) [15:05:16] RECOVERY - NTP on cp3048 is OK: NTP OK: Offset -0.07661449909 secs [15:05:46] (03Merged) 10jenkins-bot: Remove no longer used Echo configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291218 (https://phabricator.wikimedia.org/T58037) (owner: 10Matěj Suchánek) [15:07:20] !log Disabling cr1-codfw et-0/* (all row uplinks) [15:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:08:03] !log thcipriani@tin Synchronized static/images/sul: SWAT: [[gerrit:288582|Make SUL icons square and use global defaults]] (duration: 00m 41s) [15:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:08:45] Mine looks good =) [15:09:02] !log Upgrading cr1-codfw FPC 0 all PICs firmware [15:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:10:19] Lokal_Profil: awesome, thanks for checking :) [15:11:54] Anything more needed from me? [15:12:00] (03PS2) 10Thcipriani: Revert "Enable RC patrol on ta.wikiquote" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291237 (https://phabricator.wikimedia.org/T132868) (owner: 10Dereckson) [15:12:09] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291237 (https://phabricator.wikimedia.org/T132868) (owner: 10Dereckson) [15:12:28] Lokal_Profil: if you've checked your changes and all is well—that's it :) [15:12:57] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:13:04] Thanks for taking care of your changes and testing Lokal_Profil :) [15:13:08] (03Merged) 10jenkins-bot: Revert "Enable RC patrol on ta.wikiquote" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291237 (https://phabricator.wikimedia.org/T132868) (owner: 10Dereckson) [15:13:26] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 84, down: 4, dormant: 0, excluded: 0, unused: 0BRae1: down - Core: asw-a-codfw:ae1BRae2: down - Core: asw-b-codfw:ae1BRae3: down - Core: asw-c-codfw:ae1BRae4: down - Core: asw-d-codfw:ae1BR [15:13:39] !log Rebooting cr1-codfw FPC 0 [15:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:13:57] No wories. Thanks for the SWAT reminder. [15:14:12] 06Operations, 10ops-codfw: Faulty RAM on mc2001 - https://phabricator.wikimedia.org/T136558#2344954 (10Papaul) we are decommissioning es2005-es2010 and i checked both es2005 and mc2001 uses the same type of memory so I can pull a memory stick from es2005 and put in mc2001. @MoritzMuehlenhoff please let me k... [15:14:28] 06Operations, 10ops-codfw: Faulty RAM on mc2001 - https://phabricator.wikimedia.org/T136558#2344955 (10Papaul) p:05Triage>03Normal [15:14:48] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:291218|Remove no longer used Echo configuration]] PART I (duration: 00m 33s) [15:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:14:56] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 7.171 second response time [15:15:21] (03PS1) 10Alexandros Kosiaris: gerrit: Enforce a 100MB limit on objects instead [puppet] - 10https://gerrit.wikimedia.org/r/292154 [15:15:24] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:291218|Remove no longer used Echo configuration]] PART II (duration: 00m 26s) [15:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:15:51] ^ matej_suchanek changes sync'd, no log errors, so I suppose that means all is well :) [15:16:35] yeah, it should be, it was just unused configuration which was copied to local interface messages by a GS [15:17:45] matej_suchanek: cool, thank you for the patch! [15:18:13] (03PS3) 10Giuseppe Lavagetto: mediawiki::cgroup: systemd compatibility [puppet] - 10https://gerrit.wikimedia.org/r/292113 (https://phabricator.wikimedia.org/T131749) [15:18:33] you're welcome, see you soon :) [15:18:39] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:291237|Revert "Enable RC patrol on ta.wikiquote"]] (duration: 00m 25s) [15:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:18:46] ^ Dereckson check please [15:18:51] !log Re-enabling cr1-codfw et-0/* interfaces [15:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:19:17] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [15:19:28] thcipriani: RC special page looks good to me [15:19:42] !log Re-enabling OSPF on all cr1-codfw row subnets [15:19:46] Dereckson: great, thanks for checking. [15:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:20:12] (03PS2) 10Thcipriani: Enable bot passwords on zerowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292053 (https://phabricator.wikimedia.org/T135074) (owner: 10Gergő Tisza) [15:20:17] yw [15:21:30] (03PS1) 10Faidon Liambotis: Revert "route around codfw in cache::route_table" [puppet] - 10https://gerrit.wikimedia.org/r/292156 [15:21:32] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292053 (https://phabricator.wikimedia.org/T135074) (owner: 10Gergő Tisza) [15:21:34] (03PS1) 10Faidon Liambotis: Revert "depool codfw" [dns] - 10https://gerrit.wikimedia.org/r/292157 [15:22:04] (03PS2) 10Faidon Liambotis: Revert "route around codfw in cache::route_table" [puppet] - 10https://gerrit.wikimedia.org/r/292156 [15:22:33] (03Merged) 10jenkins-bot: Enable bot passwords on zerowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292053 (https://phabricator.wikimedia.org/T135074) (owner: 10Gergő Tisza) [15:22:56] (03CR) 10Faidon Liambotis: [C: 032] Revert "depool codfw" [dns] - 10https://gerrit.wikimedia.org/r/292157 (owner: 10Faidon Liambotis) [15:23:13] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Revert "route around codfw in cache::route_table" [puppet] - 10https://gerrit.wikimedia.org/r/292156 (owner: 10Faidon Liambotis) [15:24:55] (03PS2) 10Thcipriani: Remove centralauth-autoaccount right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291720 (owner: 10Gergő Tisza) [15:25:06] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291720 (owner: 10Gergő Tisza) [15:25:39] (03Merged) 10jenkins-bot: Remove centralauth-autoaccount right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291720 (owner: 10Gergő Tisza) [15:26:01] (03PS2) 10Kaldari: Set $wgSpamBlacklistEventLogging to true on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290370 [15:26:03] (03PS1) 10Kaldari: Test PageAssessments extension on Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292158 (https://phabricator.wikimedia.org/T125551) [15:26:38] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:292053|Enable bot passwords on zerowiki]] (duration: 00m 24s) [15:26:43] ^ tgr check please [15:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:27:31] kart_: I'm circling back around to yours, getting all the config changes out first, FYI :) [15:27:48] thcipriani: np [15:28:08] (03PS1) 10Jcrespo: s/wment/wmnet/ for db1080 [dns] - 10https://gerrit.wikimedia.org/r/292159 (https://phabricator.wikimedia.org/T135253) [15:28:28] thcipriani: works [15:29:48] tgr: cool, thanks [15:29:48] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:291720|Remove centralauth-autoaccount right]] (duration: 00m 25s) [15:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:30:30] thcipriani: that works too [15:32:58] PROBLEM - NTP on cp1067 is CRITICAL: NTP CRITICAL: Offset unknown [15:33:01] 06Operations, 10netops: Upgrade cr1-codfw/cr2-codfw FPC 0 firmware - https://phabricator.wikimedia.org/T136707#2345005 (10faidon) [15:33:14] !log thcipriani@tin Synchronized php-1.28.0-wmf.4/resources/src/moment-locale-overrides.js: SWAT: [[gerrit:292087|Avoid passing integers to mw.RegExp.escape]] (duration: 00m 24s) [15:33:20] ^ kart_ check please [15:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:33:29] PROBLEM - NTP on cp1071 is CRITICAL: NTP CRITICAL: Offset unknown [15:33:51] thcipriani: tested. Thanks! [15:33:59] kart_: kk, thank you [15:34:03] 06Operations, 10netops: cr2-codfw LUCHIP/trinity_pio error messages - https://phabricator.wikimedia.org/T134932#2282892 (10faidon) 05Open>03Resolved The errors disappeared, so case 2016-0510-0764 was closed in the meantime. Before I could restore the VRRP priorities, a firmware upgrade needed to happen (cf... [15:34:08] ok, starting full scap shortly [15:34:09] PROBLEM - NTP on cp1066 is CRITICAL: NTP CRITICAL: Offset unknown [15:35:16] 06Operations, 10Traffic, 07Beta-Cluster-reproducible: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#2345025 (10BBlack) Ok so, yeah, the repro is a little complicated, but this presumably isn't a varnish fault since hhvm's output is obvious... [15:35:40] (03PS1) 10Giuseppe Lavagetto: hhvm::debug: fix hhvm-dump-debug on jessie [puppet] - 10https://gerrit.wikimedia.org/r/292160 [15:35:49] !log thcipriani@tin Started scap: SWAT: [[gerrit:292103|Update for AuthManager]] [15:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:36:08] RECOVERY - NTP on cp1066 is OK: NTP OK: Offset -0.03277683258 secs [15:36:29] RECOVERY - NTP on cp1067 is OK: NTP OK: Offset -0.02504134178 secs [15:38:28] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, 03Discovery-Search-Sprint: Elasticsearch logs are not send to logstash after 2.3.3 upgrade - https://phabricator.wikimedia.org/T136696#2345034 (10Gehel) This should be fixed by adding permissions like: ``` grant codeBase "file:/usr/... [15:38:29] RECOVERY - cassandra-a CQL 10.64.0.230:9042 on restbase1007 is OK: TCP OK - 0.001 second response time on port 9042 [15:40:05] (03CR) 10Alexandros Kosiaris: "already enforced on ytterbium, to get me unblocked on pushing some ores repos to gerrit. Question is just if we want to make it permanent" [puppet] - 10https://gerrit.wikimedia.org/r/292154 (owner: 10Alexandros Kosiaris) [15:42:04] (03CR) 10Jforrester: [C: 031] Follow-up abd14947: Use HTTPS URL to citoid instead of protocol-relative [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292068 (https://phabricator.wikimedia.org/T136423) (owner: 10Alex Monk) [15:42:28] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 06Labs, 10hardware-requests: rack/upgrade/setup/install/deploy relforge100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T136708#2345045 (10RobH) [15:42:58] PROBLEM - Host cp2009 is DOWN: PING CRITICAL - Packet loss = 100% [15:44:08] (03CR) 10Alexandros Kosiaris: [C: 031] Ferm rules for puppetmaster frontend [puppet] - 10https://gerrit.wikimedia.org/r/283174 (owner: 10Muehlenhoff) [15:44:19] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 06Labs, 10hardware-requests: rack/upgrade/setup/install/deploy relforge100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T136708#2345064 (10RobH) [15:44:49] RECOVERY - Host cp2009 is UP: PING OK - Packet loss = 0%, RTA = 36.60 ms [15:48:33] !log Setting trace probability to 5% on restbase1007-a.eqiad.wmnet : T126629 [15:48:33] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [15:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:49:08] RECOVERY - NTP on cp1071 is OK: NTP OK: Offset -0.0004898309708 secs [15:50:50] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:51:07] (03PS1) 10Andrew Bogott: Puppet auth.conf: Use hostname rather than IP [puppet] - 10https://gerrit.wikimedia.org/r/292161 [15:52:49] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 6.663 second response time [15:53:03] (03CR) 10Andrew Bogott: [C: 032] Puppet auth.conf: Use hostname rather than IP [puppet] - 10https://gerrit.wikimedia.org/r/292161 (owner: 10Andrew Bogott) [15:54:26] (03PS2) 10Jcrespo: s/wment/wmnet/ for db1080 [dns] - 10https://gerrit.wikimedia.org/r/292159 (https://phabricator.wikimedia.org/T135253) [15:55:00] (03CR) 10Jcrespo: [C: 032] s/wment/wmnet/ for db1080 [dns] - 10https://gerrit.wikimedia.org/r/292159 (https://phabricator.wikimedia.org/T135253) (owner: 10Jcrespo) [15:58:06] !log Disabling trace probability on restbase1007-a.eqiad.wmnet : T126629 [15:58:07] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [15:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:58:24] !log updating dns entry for db1080.eqiad.wment [15:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:58:54] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 06Labs, 10hardware-requests: rack/upgrade/setup/install/deploy relforge100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T136708#2345159 (10RobH) [15:58:56] !log Setting trace probability on restbase1008-a.eqiad.wmnet to 5% : T126629 [15:58:57] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [15:58:57] (03CR) 10Chad: [C: 031] "Fine by me. This is just to prevent a DOS vector of people pushing unusably large objects. If there's individual repos we want to make the" [puppet] - 10https://gerrit.wikimedia.org/r/292154 (owner: 10Alexandros Kosiaris) [15:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:01:54] !log thcipriani@tin Finished scap: SWAT: [[gerrit:292103|Update for AuthManager]] (duration: 26m 05s) [16:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:02:00] ^ tgr check please [16:03:27] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 06Labs, 10hardware-requests: rack/upgrade/setup/install/deploy relforge100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T136708#2345179 (10RobH) [16:03:44] (03CR) 10Alexandros Kosiaris: [C: 032] "Great! thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/292154 (owner: 10Alexandros Kosiaris) [16:03:45] PROBLEM - cassandra-a CQL 10.64.0.230:9042 on restbase1007 is CRITICAL: Connection refused [16:03:50] (03PS2) 10Alexandros Kosiaris: gerrit: Enforce a 100MB limit on objects instead [puppet] - 10https://gerrit.wikimedia.org/r/292154 [16:03:52] (03PS1) 10Rush: labstore::fileserver::replicate ensure as a param [puppet] - 10https://gerrit.wikimedia.org/r/292164 [16:03:54] (03CR) 10Alexandros Kosiaris: [V: 032] gerrit: Enforce a 100MB limit on objects instead [puppet] - 10https://gerrit.wikimedia.org/r/292154 (owner: 10Alexandros Kosiaris) [16:04:15] (03PS2) 10Rush: labstore::fileserver::replicate ensure as a param [puppet] - 10https://gerrit.wikimedia.org/r/292164 [16:05:06] (03CR) 10jenkins-bot: [V: 04-1] labstore::fileserver::replicate ensure as a param [puppet] - 10https://gerrit.wikimedia.org/r/292164 (owner: 10Rush) [16:05:41] thcipriani: and that works too. Thanks! [16:05:58] (03CR) 10jenkins-bot: [V: 04-1] labstore::fileserver::replicate ensure as a param [puppet] - 10https://gerrit.wikimedia.org/r/292164 (owner: 10Rush) [16:09:07] (03PS1) 10BBlack: switch pybal ProxyFetch healthchecks of caches to varnishcheck [puppet] - 10https://gerrit.wikimedia.org/r/292166 [16:09:07] what's with restbase1010/11 having puppet disabled? [16:09:18] urandom: ^ [16:09:29] paravoid: mid-upgrade [16:09:36] paravoid: which isn't going well [16:09:40] (03PS3) 10Milimetric: Support additional reportupdater directories [puppet] - 10https://gerrit.wikimedia.org/r/289007 (https://phabricator.wikimedia.org/T126549) [16:09:45] ugh :/ [16:09:46] (03PS2) 10Gergő Tisza: Enable AuthManager in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289866 (https://phabricator.wikimedia.org/T135498) [16:09:56] urandom, sorry to hear that, but what does puppet do, auto-start? [16:10:14] 06Operations, 10ops-codfw: Faulty RAM on mc2001 - https://phabricator.wikimedia.org/T136558#2338860 (10faidon) @Papaul, it's good to go at your convenience. [16:10:26] jynus: it would apply the update configuration to two nodes that are not current broken, but would be as soon as it ran [16:10:35] currently [16:10:47] so, it auto-loads the config? [16:11:22] no, puppet would write a configuration for the upgraded version of Cassandra, that wouldn't work for the current version [16:11:32] ok [16:12:21] I was looking for alternatives to disabling puppet, but not a good moment now [16:13:22] jouncebot: next [16:13:22] In 2 hour(s) and 46 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160601T1900) [16:13:27] tgr: It's yours :) [16:14:57] (03CR) 10Alexandros Kosiaris: [C: 032] Assign roles::ores::web, roles::ores::worker to SCB [puppet] - 10https://gerrit.wikimedia.org/r/278990 (https://phabricator.wikimedia.org/T124201) (owner: 10Alexandros Kosiaris) [16:15:06] (03PS3) 10Alexandros Kosiaris: Assign roles::ores::web, roles::ores::worker to SCB [puppet] - 10https://gerrit.wikimedia.org/r/278990 (https://phabricator.wikimedia.org/T124201) [16:16:28] (03CR) 10Alexandros Kosiaris: [V: 032] Assign roles::ores::web, roles::ores::worker to SCB [puppet] - 10https://gerrit.wikimedia.org/r/278990 (https://phabricator.wikimedia.org/T124201) (owner: 10Alexandros Kosiaris) [16:17:21] ^I also have the access pending, ready to deploy [16:20:56] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [16:21:48] 06Operations, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-GettingStarted: GettingStarted on Beta Cluster periodically loses its Redis index - https://phabricator.wikimedia.org/T100515#2345230 (10Mattflaschen-WMF) a:05Mattflaschen-WMF>03None [16:22:33] I am going to check rdb2006 [16:22:33] !log Disabling traces on restbase1008-a.eqiad.wmnet : T126629 [16:22:34] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [16:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:22:56] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5927036 keys - replication_delay is 0 [16:23:34] mm, the service was up [16:27:36] PROBLEM - puppet last run on scb1001 is CRITICAL: CRITICAL: Puppet has 2 failures [16:27:49] 06Operations, 10ops-codfw: rack/setup/deploy mw22[1-5][0-9] switch configuration - https://phabricator.wikimedia.org/T136670#2345264 (10RobH) a:05RobH>03Papaul switch port description updated for mw2215-mw2238. (They were already enabled and in the proper vlan, so only description needed update.) Assigni... [16:29:15] (03PS1) 10Eevans: Put 2.2 upgrade on hold for restbase101[0-1].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/292169 (https://phabricator.wikimedia.org/T126629) [16:30:04] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 06Labs, 10hardware-requests: rack/upgrade/setup/install/deploy relforge100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T136708#2345269 (10Cmjohnson) [16:30:23] jynus, paravoid: ^^^ [16:31:30] is the default value ok? [16:31:42] yes [16:31:56] PROBLEM - puppet last run on scb2002 is CRITICAL: CRITICAL: Puppet has 2 failures [16:32:30] I see default "$target_version = '2.1'" [16:32:36] PROBLEM - puppet last run on scb2001 is CRITICAL: CRITICAL: Puppet has 2 failures [16:32:38] yeah [16:32:39] so no one has been upgraded to 2.2 [16:32:48] 1007 has [16:32:54] and labs, and staging [16:32:55] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [16:33:15] I see 7, 10 and 11 with 2.2 [16:33:17] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.0.230:9042 on restbase1007 is CRITICAL: Connection refused eevans POS - The acknowledgement expires at: 2016-06-02 16:33:03. [16:33:30] ok, seems good [16:33:32] will deploy [16:33:47] (03CR) 10Jcrespo: [C: 032] Put 2.2 upgrade on hold for restbase101[0-1].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/292169 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [16:34:11] thanks, i'll re-enable puppet on those hosts [16:34:15] PROBLEM - puppet last run on scb1002 is CRITICAL: CRITICAL: Puppet has 2 failures [16:34:18] wait until I deploy [16:34:25] yup [16:34:59] it should be ok now [16:35:14] is there something I can do for help for the upgrade? [16:35:38] legoktm: do extension.json changes require a scap or is it OK to just sync them? [16:36:01] jynus: the upgraded node is manifesting very high disk read throughput, and I don't know why [16:36:08] tgr: you should only need a scap if there are new/updated messages [16:36:27] that is after repooled? [16:37:19] jynus: it's joined with the other nodes, yes, though i have client connections disabled so it doesn't impact latency [16:37:41] strange, I would expect writes [16:38:13] it is writing, and serving reads normally (though with some added latency) [16:38:41] (03PS3) 10Dzahn: Adjust "monthly Phab community metrics" SQL query to upstream changes [puppet] - 10https://gerrit.wikimedia.org/r/292149 (https://phabricator.wikimedia.org/T136667) (owner: 10Aklapper) [16:38:43] but there is a read rate unexplained by the number of reads (and write throughput is normal) [16:38:50] 06Operations, 10ops-codfw, 10DBA: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2345286 (10Papaul) Yes i will need downtime for that but let me download the firmware first and i will let you or Volans know thanks. [16:39:12] !log tgr@tin Synchronized php-1.28.0-wmf.4/extensions/NewUserMessage/: backport [[gerrit:292168]] to update NewUserMessage for AuthManager (duration: 00m 29s) [16:39:15] (03CR) 10Dzahn: [C: 032] "excellent with the link to upstream change and all" [puppet] - 10https://gerrit.wikimedia.org/r/292149 (https://phabricator.wikimedia.org/T136667) (owner: 10Aklapper) [16:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:39:32] jynus: like, 10x the usual read throughput [16:40:03] is that happening now on 7? [16:40:13] jynus: https://grafana-admin.wikimedia.org/dashboard/db/restbase-cassandra-system?from=now-12h&to=now&panelId=7&fullscreen [16:40:17] tl;dr yes [16:40:27] I see [16:40:54] 06Operations, 10ops-codfw: Faulty RAM on mc2001 - https://phabricator.wikimedia.org/T136558#2345292 (10Papaul) 05Open>03Resolved system is back up with 96GB of RAM. [16:41:06] RECOVERY - IPsec on mc1017 is OK: Strongswan OK - 1 ESP OK [16:41:08] if i could figure out what was being read, that might help [16:41:17] but it hasn't proven easy so far [16:41:26] RECOVERY - IPsec on mc1001 is OK: Strongswan OK - 1 ESP OK [16:42:05] PROBLEM - puppet last run on restbase1011 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [16:43:42] the IO is basically all via mmap'ed files, and we haven't found a way to trace down the files they correspond to yet [16:43:57] RECOVERY - puppet last run on restbase1011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:44:23] (03PS1) 10Elukey: Extend the %{format}t timestamp formatter with (begin|end): prefixes [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/292172 (https://phabricator.wikimedia.org/T136314) [16:44:55] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [16:45:05] the best I can think of would be some combination of blktrace and manual detective work to figure out which files those blocks correspond to [16:46:02] well lots of table files open, nothing surprising [16:46:15] the number of minor and major page faults are comparable to the other machines [16:47:05] I see a sequetial-like read of tmplink-la-55759-big-Data.db [16:47:18] (03CR) 10Gergő Tisza: [C: 032] Enable AuthManager in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289866 (https://phabricator.wikimedia.org/T135498) (owner: 10Gergő Tisza) [16:47:27] urandom: the rate of major faults reflects the much greater read rate [16:47:49] sorry, I have to check icinga in a second [16:48:02] jynus: Error: Could not find any host matching 'db1092' (config file '/etc/icinga/puppet_hostextinfo.cfg', starting on line 2025) [16:48:09] gwicke: i found the number to be similar to the other machines [16:48:31] elukey, I supposed so- those hosts are being recently installed [16:48:31] jynus: how are you seeing a that (a sequential-like read)? [16:48:39] the number, perhaps -- those have been running a lot longer [16:48:49] I was looking at the rate [16:49:02] urandom, it is a pure heuristic, just looking at the offset open from the os [16:49:40] gwicke: how are you getting rate? [16:49:47] RECOVERY - puppet last run on scb1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:50:01] (03PS3) 10Gergő Tisza: Enable AuthManager in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289866 (https://phabricator.wikimedia.org/T135498) [16:50:13] urandom: compare the change of the output of `ps -eo maj_flt,cmd | sort -n | tail -1 | cut -d ' ' -f 1` over a minute or so [16:50:15] (03CR) 10Gergő Tisza: [C: 032] Enable AuthManager in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289866 (https://phabricator.wikimedia.org/T135498) (owner: 10Gergő Tisza) [16:50:23] gwicke: oh ok [16:50:42] gwicke: thought you might have something fancy [16:50:58] maybe a missing regex on hiera? [16:51:00] (03Merged) 10jenkins-bot: Enable AuthManager in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289866 (https://phabricator.wikimedia.org/T135498) (owner: 10Gergő Tisza) [16:52:05] RECOVERY - puppet last run on scb2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:52:16] RECOVERY - puppet last run on scb1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:52:46] RECOVERY - puppet last run on scb2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:53:41] urandom: over the last 60 seconds, the busiest cassandra process on 1007 produced 15102 faults, while the same on 1008 only saw 2554 [16:54:02] !log tgr@tin Synchronized wmf-config/InitialiseSettings-labs.php: T135504: enable AuthManager in beta (duration: 00m 32s) [16:54:03] T135504: Enable AuthManager in WMF production - https://phabricator.wikimedia.org/T135504 [16:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:54:21] probably a race condition? I will know it in a second [16:54:51] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [16:54:59] elukey, "Things look okay - No serious problems were detected during the pre-flight check" [16:55:25] but it makes no sense, how can icinga know about the host before puppet is run? [16:55:26] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 06Labs, 10hardware-requests: rack/upgrade/setup/install/deploy relforge100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T136708#2345378 (10RobH) [16:55:30] (03PS1) 10Alexandros Kosiaris: ores: Hiera fix for ores redis_host [puppet] - 10https://gerrit.wikimedia.org/r/292177 [16:56:29] gwicke: yeah, i just came up with the same. ~260/s vs. ~48/s [16:56:55] (03CR) 10Ema: [C: 031] "Sounds better than the current approach!" [puppet] - 10https://gerrit.wikimedia.org/r/292166 (owner: 10BBlack) [16:57:34] 06Operations, 10ops-codfw, 10DBA: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2345391 (10jcrespo) Probably just me (Volans is away). We may want to do it on all ES servers in the end- but there is no hurry for the other hosts. Ping me at almost any time when you see me... [16:58:12] PROBLEM - ores on scb2001 is CRITICAL: Connection refused [16:58:12] PROBLEM - ores on scb2002 is CRITICAL: Connection refused [16:59:01] PROBLEM - ores on scb1002 is CRITICAL: Connection refused [17:00:42] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 06Labs, 10hardware-requests: rack/upgrade/setup/install/deploy relforge100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T136708#2345398 (10RobH) [17:02:09] (03CR) 10Alexandros Kosiaris: [C: 032] ores: Hiera fix for ores redis_host [puppet] - 10https://gerrit.wikimedia.org/r/292177 (owner: 10Alexandros Kosiaris) [17:02:14] (03PS2) 10Alexandros Kosiaris: ores: Hiera fix for ores redis_host [puppet] - 10https://gerrit.wikimedia.org/r/292177 [17:02:19] (03CR) 10Alexandros Kosiaris: [V: 032] ores: Hiera fix for ores redis_host [puppet] - 10https://gerrit.wikimedia.org/r/292177 (owner: 10Alexandros Kosiaris) [17:02:24] jynus: no idea :( [17:02:36] well, it is fixed now :-) [17:03:32] PROBLEM - Host lvs2006 is DOWN: PING CRITICAL - Packet loss = 100% [17:05:10] !log powered off lvs2006 for disk swap [17:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:05:18] !log powered on lvs2006. disk change did not happen [17:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:09:33] 06Operations: spare/unused disks on application servers - https://phabricator.wikimedia.org/T106381#2345441 (10faidon) [17:09:35] 06Operations: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#2345439 (10faidon) [17:09:37] 06Operations, 10ops-eqiad: ship single dell 500GB sata to ulsfo - https://phabricator.wikimedia.org/T133699#2345440 (10faidon) [17:09:43] 06Operations, 10ops-eqiad: ship single dell 500GB sata to ulsfo - https://phabricator.wikimedia.org/T133699#2239809 (10faidon) Ping? [17:11:37] (03CR) 10Dzahn: [C: 031] Ferm rules for puppetmaster frontend [puppet] - 10https://gerrit.wikimedia.org/r/283174 (owner: 10Muehlenhoff) [17:12:24] (03PS2) 10Dzahn: wikistats: stop using package=>latest for php-cli [puppet] - 10https://gerrit.wikimedia.org/r/292091 (owner: 10Muehlenhoff) [17:12:39] (03PS3) 10Dzahn: wikistats: stop using package=>latest for php-cli [puppet] - 10https://gerrit.wikimedia.org/r/292091 (owner: 10Muehlenhoff) [17:13:00] (03CR) 10Dzahn: [C: 032] "labs-only" [puppet] - 10https://gerrit.wikimedia.org/r/292091 (owner: 10Muehlenhoff) [17:13:01] RECOVERY - Host lvs2006 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [17:13:13] 06Operations: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#2345451 (10faidon) [17:14:38] 06Operations: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#2338953 (10faidon) @Dzahn, could you perhaps take on some of these as part of your jessie-fication/un-precise-fication effort? [17:15:48] (03PS1) 10Muehlenhoff: Add a pager for commands with extensive output [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/292179 [17:17:07] 06Operations: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#2345481 (10Dzahn) yep, magnesium will be soon gone . then chromium and hydrogen will be next [17:17:34] 06Operations, 10netops: Turn up new eqiad-esams wave (Level3) - https://phabricator.wikimedia.org/T136717#2345483 (10faidon) [17:21:42] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [17:22:44] 06Operations, 10ops-codfw: lvs2006 degraded RAID - https://phabricator.wikimedia.org/T136584#2345517 (10Papaul) The disk that was sent to me is the wrong size , so I am shipping it back. [17:22:49] (03CR) 10Dzahn: "so we are re-using IPs that were formerly old servers? what happened to the ones that are replaced here, i might just lack the context" [dns] - 10https://gerrit.wikimedia.org/r/292072 (https://phabricator.wikimedia.org/T135466) (owner: 10Papaul) [17:22:59] 06Operations, 10ops-codfw, 10DBA: db2034 degraded RAID - https://phabricator.wikimedia.org/T136583#2345518 (10Papaul) The disk that was sent to me is the wrong size , so I am shipping it back. [17:23:21] 06Operations, 10ops-codfw, 10DBA: db2034 degraded RAID - https://phabricator.wikimedia.org/T136583#2345519 (10jcrespo) :-( [17:26:23] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add a pager for commands with extensive output [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/292179 (owner: 10Muehlenhoff) [17:26:46] (03PS2) 10Elukey: Extend the %{format}t timestamp formatter with (begin|end): prefixes [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/292172 (https://phabricator.wikimedia.org/T136314) [17:27:26] (03CR) 10Dzahn: "ok, i see now " Re-used the same mgmt IP's" on ticket, yep" [dns] - 10https://gerrit.wikimedia.org/r/292072 (https://phabricator.wikimedia.org/T135466) (owner: 10Papaul) [17:29:11] PROBLEM - Host cp2017 is DOWN: PING CRITICAL - Packet loss = 100% [17:31:31] (03CR) 10Kaldari: [C: 032] Test PageAssessments extension on Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292158 (https://phabricator.wikimedia.org/T125551) (owner: 10Kaldari) [17:32:02] (03PS3) 10Krinkle: ULS: Stop using /static/current [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289652 (https://phabricator.wikimedia.org/T135806) (owner: 10Nikerabbit) [17:33:28] (03CR) 10Krinkle: "When deploying, I suggest applying this to a canary server first and verify everything is working properly and requested from the expected" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289652 (https://phabricator.wikimedia.org/T135806) (owner: 10Nikerabbit) [17:33:33] (03CR) 10Krinkle: "https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289652 (https://phabricator.wikimedia.org/T135806) (owner: 10Nikerabbit) [17:34:32] PROBLEM - NTP on cp2022 is CRITICAL: NTP CRITICAL: Offset unknown [17:34:41] (03PS3) 10Elukey: Extend the %{format}t timestamp formatter with (begin|end): prefixes [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/292172 (https://phabricator.wikimedia.org/T136314) [17:35:32] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6 [17:35:32] PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2017_v4, cp2017_v6 [17:35:32] PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2017_v4, cp2017_v6 [17:35:33] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2017_v4, cp2017_v6 [17:36:01] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6 [17:36:02] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2017_v4, cp2017_v6 [17:36:02] PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2017_v4, cp2017_v6 [17:36:02] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2017_v4, cp2017_v6 [17:36:05] sorry for the icinga spam, cp2017 didn't reboot cleanly ^ [17:36:11] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 144 not-conn: cp2017_v4, cp2017_v6, cp2019_v4, cp2019_v6 [17:36:23] powercycling it now [17:36:31] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6 [17:36:32] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 144 not-conn: cp2017_v4, cp2017_v6, cp2019_v4, cp2019_v6 [17:36:41] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6 [17:36:42] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6 [17:36:51] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6 [17:36:52] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2017_v4, cp2017_v6 [17:36:52] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 144 not-conn: cp2017_v4, cp2017_v6, cp2019_v4, cp2019_v6 [17:36:52] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 144 not-conn: cp2017_v4, cp2017_v6, cp2019_v4, cp2019_v6 [17:37:02] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2017_v4, cp2017_v6 [17:37:11] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6 [17:37:11] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6 [17:37:12] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2017_v4, cp2017_v6 [17:37:32] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6 [17:37:42] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2017_v4, cp2017_v6 [17:37:52] RECOVERY - Host cp2017 is UP: PING OK - Packet loss = 0%, RTA = 36.79 ms [17:38:01] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 56 ESP OK [17:38:02] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 54 ESP OK [17:38:02] RECOVERY - IPsec on cp4013 is OK: Strongswan OK - 54 ESP OK [17:38:02] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 54 ESP OK [17:38:02] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 148 ESP OK [17:38:22] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 56 ESP OK [17:38:31] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 148 ESP OK [17:38:41] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 56 ESP OK [17:38:41] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 56 ESP OK [17:38:42] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 56 ESP OK [17:38:51] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 54 ESP OK [17:38:52] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 148 ESP OK [17:38:52] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 148 ESP OK [17:39:02] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 54 ESP OK [17:39:03] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 56 ESP OK [17:39:03] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 56 ESP OK [17:39:11] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 54 ESP OK [17:39:22] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 56 ESP OK [17:39:31] RECOVERY - IPsec on cp4005 is OK: Strongswan OK - 54 ESP OK [17:39:31] RECOVERY - IPsec on cp4014 is OK: Strongswan OK - 54 ESP OK [17:39:31] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 56 ESP OK [17:39:32] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 54 ESP OK [17:39:41] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 54 ESP OK [17:44:51] PROBLEM - Host cp2008 is DOWN: PING CRITICAL - Packet loss = 100% [17:45:53] there are 2 unaccepted salt keys on salt-master: druid1003 and mw2060 [17:46:55] 2 on puppet-master, analytics1012 and promethium [17:46:59] 06Operations, 10netops: Turn up new eqiad-esams wave (Level3) - https://phabricator.wikimedia.org/T136717#2345574 (10faidon) Spoke with the Level3 representative just now. Estimated delivery of the circuit is July 8th. [17:47:05] !log Temporarily disabling puppet to test setting on restbase1007.eqiad.wmnet : T126629 [17:47:06] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [17:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:48:43] !log depooled reboot of cp4* hosts (T131928) [17:48:43] T131928: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928 [17:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:51:24] !log Restarting Cassandra on restbase1007.eqiad.wmnet : T126629 [17:51:25] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [17:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:51:52] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 144 not-conn: cp2006_v4, cp2006_v6, cp2008_v4, cp2008_v6 [17:52:01] PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [17:52:01] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [17:52:02] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 144 not-conn: cp2006_v4, cp2006_v6, cp2008_v4, cp2008_v6 [17:52:02] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [17:52:13] PROBLEM - IPsec on cp4015 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [17:52:21] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6 [17:52:22] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 144 not-conn: cp2006_v4, cp2006_v6, cp2008_v4, cp2008_v6 [17:52:31] RECOVERY - NTP on cp2022 is OK: NTP OK: Offset -0.0002706050873 secs [17:52:31] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6 [17:52:32] PROBLEM - IPsec on cp4006 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [17:52:32] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [17:52:32] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 144 not-conn: cp2006_v4, cp2006_v6, cp2008_v4, cp2008_v6 [17:52:32] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6 [17:52:32] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6 [17:52:42] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6 [17:52:42] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6 [17:52:42] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [17:52:43] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [17:52:51] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 144 not-conn: cp2006_v4, cp2006_v6, cp2008_v4, cp2008_v6 [17:52:51] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 144 not-conn: cp2006_v4, cp2006_v6, cp2008_v4, cp2008_v6 [17:52:51] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [17:53:01] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [17:53:01] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6 [17:53:02] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6 [17:53:02] PROBLEM - IPsec on cp4007 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [17:53:11] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [17:53:22] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6 [17:53:22] PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [17:53:23] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6 [17:53:31] PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [17:53:32] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [17:53:41] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [17:53:41] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [17:53:52] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2008_v4, cp2008_v6 [17:53:52] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2008_v6 [17:54:12] RECOVERY - cassandra-a CQL 10.64.0.230:9042 on restbase1007 is OK: TCP OK - 0.003 second response time on port 9042 [17:54:27] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 06Labs, 10hardware-requests: rack/upgrade/setup/install/deploy relforge100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T136708#2345595 (10Cmjohnson) [17:55:11] PROBLEM - Host cp2006 is DOWN: PING CRITICAL - Packet loss = 100% [17:55:19] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: investigate RAID failure on beryllium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T135178#2345597 (10Cmjohnson) Ended up replacing both disks and Jeff will re-install. [17:56:06] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack and setup Fundraising DB - https://phabricator.wikimedia.org/T136200#2345600 (10Cmjohnson) Silicon has a send ethernet port wired up and does not appear to be in use. [17:57:56] (03PS2) 10Dzahn: DNS: Add mgmt dns for the frist 24 new mw app servers mw22[1-3[0-9] Bug:T135466 [dns] - 10https://gerrit.wikimedia.org/r/292072 (https://phabricator.wikimedia.org/T135466) (owner: 10Papaul) [17:58:21] RECOVERY - cassandra-b CQL 10.64.0.231:9042 on restbase1007 is OK: TCP OK - 0.005 second response time on port 9042 [17:58:27] 06Operations, 06Labs: labnet100[12].eqiad.wmnet need to be reimaged with RAID] - https://phabricator.wikimedia.org/T136718#2345604 (10chasemp) [17:58:37] (03CR) 10Dzahn: "yep, we talked about the details on IRC, the new host DRACs already have these IPs, mw2017 is special was not decom'ed" [dns] - 10https://gerrit.wikimedia.org/r/292072 (https://phabricator.wikimedia.org/T135466) (owner: 10Papaul) [17:59:11] RECOVERY - Host cp2006 is UP: PING OK - Packet loss = 0%, RTA = 36.72 ms [18:00:18] 06Operations, 10DBA: Install, configure and provision recently arrived db core machines - https://phabricator.wikimedia.org/T133398#2345618 (10jcrespo) OS installed and initial puppetization/salt of: db1079 db1080 db1081 db1083 db1084 db1087 db1088 db1090 db1091 db1092 db1093 Aside from the other hosts, they... [18:01:12] RECOVERY - cassandra-c CQL 10.64.0.232:9042 on restbase1007 is OK: TCP OK - 0.006 second response time on port 9042 [18:02:09] (03CR) 10Dzahn: [C: 032] "yep, we talked about the details on IRC, the new host DRACs already have these IPs, mw2017 is special was not decom'ed" [dns] - 10https://gerrit.wikimedia.org/r/292072 (https://phabricator.wikimedia.org/T135466) (owner: 10Papaul) [18:02:12] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 148 ESP OK [18:02:12] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 54 ESP OK [18:02:12] RECOVERY - IPsec on cp4013 is OK: Strongswan OK - 54 ESP OK [18:02:12] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 54 ESP OK [18:02:22] RECOVERY - Host cp2008 is UP: PING OK - Packet loss = 0%, RTA = 36.89 ms [18:02:43] PROBLEM - traffic-pool service on cp2008 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is activating [18:02:44] RECOVERY - IPsec on cp4014 is OK: Strongswan OK - 54 ESP OK [18:02:45] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 54 ESP OK [18:02:53] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 54 ESP OK [18:02:54] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 54 ESP OK [18:02:54] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 148 ESP OK [18:02:55] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 54 ESP OK [18:03:04] RECOVERY - IPsec on cp4006 is OK: Strongswan OK - 54 ESP OK [18:03:14] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 54 ESP OK [18:03:14] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 56 ESP OK [18:03:23] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 56 ESP OK [18:03:23] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 56 ESP OK [18:03:23] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 56 ESP OK [18:03:24] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 148 ESP OK [18:03:33] (03CR) 10Faidon Liambotis: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [18:03:33] PROBLEM - confd service on cp4013 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive [18:03:33] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 54 ESP OK [18:03:44] RECOVERY - traffic-pool service on cp2008 is OK: OK - traffic-pool is active [18:03:53] PROBLEM - NTP on cp4013 is CRITICAL: NTP CRITICAL: Offset unknown [18:03:54] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 54 ESP OK [18:04:15] RECOVERY - IPsec on cp4015 is OK: Strongswan OK - 54 ESP OK [18:04:44] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 54 ESP OK [18:05:04] RECOVERY - confd service on cp4013 is OK: OK - confd is active [18:05:04] PROBLEM - NTP on cp2008 is CRITICAL: NTP CRITICAL: Offset unknown [18:05:23] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 54 ESP OK [18:05:34] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 54 ESP OK [18:05:54] RECOVERY - IPsec on cp4005 is OK: Strongswan OK - 54 ESP OK [18:07:34] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 56 ESP OK [18:07:34] RECOVERY - IPsec on cp4007 is OK: Strongswan OK - 54 ESP OK [18:07:45] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 56 ESP OK [18:08:01] (03PS3) 10Rush: labstore::fileserver::replicate ensure as a param [puppet] - 10https://gerrit.wikimedia.org/r/292164 [18:08:13] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 56 ESP OK [18:08:13] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 56 ESP OK [18:08:24] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 56 ESP OK [18:08:44] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 148 ESP OK [18:09:03] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 56 ESP OK [18:11:25] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp4005_v4, cp4005_v6 [18:12:45] (03PS4) 10Rush: labstore::fileserver::replicate ensure as a param [puppet] - 10https://gerrit.wikimedia.org/r/292164 [18:13:23] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 148 ESP OK [18:13:24] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 56 ESP OK [18:13:24] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 70 ESP OK [18:13:33] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 148 ESP OK [18:14:47] (03CR) 10Rush: [C: 032] labstore::fileserver::replicate ensure as a param [puppet] - 10https://gerrit.wikimedia.org/r/292164 (owner: 10Rush) [18:16:25] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:16:40] !log stopping kafka broker on kafka1018 and rebooting node [18:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:18:24] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure: connect usb external disk to labmon1001 - https://phabricator.wikimedia.org/T136242#2345681 (10RobH) a:05RobH>03Cmjohnson Tried to mount and format this, and it only shows up as 800GB disk in the OS. Can you plug this into your laptop and see i... [18:21:03] RECOVERY - NTP on cp2008 is OK: NTP OK: Offset 0.0001058578491 secs [18:21:34] RECOVERY - NTP on cp4013 is OK: NTP OK: Offset 0.0001415014267 secs [18:21:55] PROBLEM - statsv process on hafnium is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args statsv [18:25:57] !log restarting Cassandra on restbase1007.eqiad.wmnet [18:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:31:33] PROBLEM - cassandra-a CQL 10.64.0.230:9042 on restbase1007 is CRITICAL: Connection refused [18:32:44] PROBLEM - cassandra-b CQL 10.64.0.231:9042 on restbase1007 is CRITICAL: Connection refused [18:33:01] 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs: Implement role based hiera lookups for labs - https://phabricator.wikimedia.org/T120165#2345822 (10yuvipanda) What this would concretely solve (for me) is projects like tools, where now I have to manually set hiera config on each host of a part... [18:33:34] RECOVERY - statsv process on hafnium is OK: PROCS OK: 13 processes with command name python, args statsv [18:33:57] 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs: Implement role based hiera lookups for labs - https://phabricator.wikimedia.org/T120165#2345829 (10yuvipanda) We're also trying to kill all LDAP variables, see T101447 [18:34:53] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /page/title/{title} (Get rev by title from storage) is CRITICAL [18:35:11] (03PS1) 10Volans: MariaDB: Add new coredb servers [puppet] - 10https://gerrit.wikimedia.org/r/292197 (https://phabricator.wikimedia.org/T133398) [18:35:43] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200) [18:36:49] (03CR) 10Volans: "@jynus: this is according to the plan in https://phabricator.wikimedia.org/T111992, feel free to discard it and/or include them few at a " [puppet] - 10https://gerrit.wikimedia.org/r/292197 (https://phabricator.wikimedia.org/T133398) (owner: 10Volans) [18:37:04] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] [18:37:05] PROBLEM - cassandra-c CQL 10.64.0.232:9042 on restbase1007 is CRITICAL: Connection refused [18:37:33] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [18:38:14] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page returned [18:38:40] 06Operations, 10DBA, 13Patch-For-Review: Install, configure and provision recently arrived db core machines - https://phabricator.wikimedia.org/T133398#2345865 (10Volans) @jcrespo the above CR is according to the plan I've put in T111992, of course review the plan for possible mistakes and feel free to disca... [18:38:53] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /page/html/{title} (Get html by title from storage) is CRITICAL: Test Ge [18:39:03] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [18:39:14] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp San Francisco page via mobile-sections-lead) is CRITICAL: Test retriev [18:39:14] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [1000.0] [18:39:33] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [18:39:53] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Barack Obama page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Barack Obama page via mobile-sections-lead returned the unexpected status 500 (expecting: 200) [18:40:38] (03PS19) 10Dzahn: install_server: split out reprepro to module aptrepo [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) [18:42:44] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [18:43:34] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [18:44:01] (03PS1) 10Catrope: Enable Flow beta feature on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292198 (https://phabricator.wikimedia.org/T136684) [18:44:45] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [18:44:48] (03CR) 10Nikerabbit: "We should wait until wmf.4 rolls out because currently incorrect ?version=undefined is still appended to the url." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289652 (https://phabricator.wikimedia.org/T135806) (owner: 10Nikerabbit) [18:46:03] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /page/revision/{revision} (Get rev by ID) is CRITICAL: Test Get rev by ID returned the unexpected status 500 (expecting: 200) [18:48:54] (03CR) 10Catrope: [C: 04-2] "Not before June 9th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292198 (https://phabricator.wikimedia.org/T136684) (owner: 10Catrope) [18:49:41] 06Operations, 06Labs, 10netops: Intermittent bandwidth issue to labs proxy (eqiad) from Comcast in Portland OR - https://phabricator.wikimedia.org/T136671#2345885 (10brion) As of 11:44 am pacific time I'm seeing 24Mbps on the new route through Chicago, down from 80Mbps earlier this morning. [18:51:43] RECOVERY - cassandra-a CQL 10.64.0.230:9042 on restbase1007 is OK: TCP OK - 0.023 second response time on port 9042 [18:51:53] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [18:51:53] RECOVERY - cassandra-b CQL 10.64.0.231:9042 on restbase1007 is OK: TCP OK - 0.001 second response time on port 9042 [18:51:55] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [18:52:23] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [18:52:43] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [18:52:45] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [18:52:54] RECOVERY - cassandra-c CQL 10.64.0.232:9042 on restbase1007 is OK: TCP OK - 0.001 second response time on port 9042 [18:53:13] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [18:58:24] PROBLEM - Varnishkafka log producer on cp4002 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [19:00:05] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160601T1900). [19:00:36] (03Abandoned) 10Addshore: DNM Clone grafana piechart plugin [puppet] - 10https://gerrit.wikimedia.org/r/290212 (https://phabricator.wikimedia.org/T121846) (owner: 10Addshore) [19:03:40] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:03:58] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:04:09] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:08:40] (03CR) 10Dzahn: "needed more manual rebase and" [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [19:10:37] (03PS1) 10Rush: introduce fileserver::secondary role [puppet] - 10https://gerrit.wikimedia.org/r/292205 (https://phabricator.wikimedia.org/T126083) [19:10:57] (03PS2) 10Rush: introduce fileserver::secondary role [puppet] - 10https://gerrit.wikimedia.org/r/292205 (https://phabricator.wikimedia.org/T126083) [19:12:08] (03PS3) 10Rush: introduce fileserver::secondary role [puppet] - 10https://gerrit.wikimedia.org/r/292205 (https://phabricator.wikimedia.org/T126083) [19:12:19] (03PS4) 10Rush: introduce fileserver::secondary role [puppet] - 10https://gerrit.wikimedia.org/r/292205 (https://phabricator.wikimedia.org/T126083) [19:14:20] (03CR) 10jenkins-bot: [V: 04-1] introduce fileserver::secondary role [puppet] - 10https://gerrit.wikimedia.org/r/292205 (https://phabricator.wikimedia.org/T126083) (owner: 10Rush) [19:15:09] RECOVERY - Varnishkafka log producer on cp4002 is OK: PROCS OK: 1 process with command name varnishkafka [19:15:51] thcipriani: I think there may be some jenkins weirdness, getting -1's but "1 FATAL: Command "git submodule update" returned status code 1:" [19:16:26] chasemp: which patch? [19:16:41] thcipriani: https://gerrit.wikimedia.org/r/#/c/292205/ [19:17:32] hmm "Could not resolve host: gerrit.wikimedia.org" doesn't sound good :\ [19:18:22] thcipriani: still want me to run the train? [19:19:37] 06Operations, 10ops-eqiad: ship single dell 500GB sata to ulsfo - https://phabricator.wikimedia.org/T133699#2346057 (10RobH) Sorry, this is at the office. I'll pick it up and go onsite to ulsfo this Friday. [19:19:38] twentyafterfour: yes if you would roll forward, I would appreciate it. I'm going to leave here in a few minutes. [19:19:53] 06Operations, 10ops-ulsfo: ship single dell 500GB sata to ulsfo - https://phabricator.wikimedia.org/T133699#2346058 (10RobH) [19:19:55] I can resolve it within labs otherwise [19:20:09] twentyafterfour: this bug is *sort of* a blocker, but it's seemed twitchy: https://phabricator.wikimedia.org/T136705 [19:20:29] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [19:22:29] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [19:22:47] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure: connect usb external disk to labmon1001 - https://phabricator.wikimedia.org/T136242#2346061 (10Cmjohnson) Checked the 3TB on my macbook and it came up w/ 800GB as well. Swapped to a 2TB and shows normal. Plugged into labmon1001 and it appears as we... [19:22:53] (03CR) 10Rush: [C: 032 V: 032] "jenkins seems unhappy but I don't think it's related here" [puppet] - 10https://gerrit.wikimedia.org/r/292205 (https://phabricator.wikimedia.org/T126083) (owner: 10Rush) [19:23:10] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure: connect usb external disk to labmon1001 - https://phabricator.wikimedia.org/T136242#2346062 (10RobH) 05Open>03Resolved [19:25:53] (03PS1) 10Jhobs: Enable Hovercards for huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292206 [19:28:10] chasemp: hmm, I'm a little worried that this failure maybe had something to do with the fact that it cloned from scandium but I can't find a difference between repos on either gallium or scandium and I'm tempted to call it a fluke inside the nodepool box. [19:28:53] (03CR) 10Rush: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/292205 (https://phabricator.wikimedia.org/T126083) (owner: 10Rush) [19:29:06] thcipriani: I thikn it's blocking my merge? [19:29:15] but yeah I don't get it either atm [19:29:44] brb tho [19:29:54] yeah, I actually have to run, too [19:30:11] hopefully the "recheck" will get it merged. [19:31:46] PROBLEM - Host cp1044 is DOWN: PING CRITICAL - Packet loss = 100% [19:32:04] (03PS2) 10BBlack: switch pybal ProxyFetch healthchecks of caches to varnishcheck [puppet] - 10https://gerrit.wikimedia.org/r/292166 [19:32:34] (03CR) 10BBlack: [C: 032 V: 032] switch pybal ProxyFetch healthchecks of caches to varnishcheck [puppet] - 10https://gerrit.wikimedia.org/r/292166 (owner: 10BBlack) [19:39:39] RECOVERY - Host cp1044 is UP: PING OK - Packet loss = 0%, RTA = 1.19 ms [19:40:10] so uh [19:40:34] !log restarting pybals for healthcheck config changes [19:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:40:42] "Wikimedia\Assert\ParameterTypeException" [19:40:53] anyone know wth that's about? [19:41:16] e.g. https://test.wikidata.org/wiki/Special:ListUsers [19:41:29] twentyafterfour: legoktm started debugging this yesterday [19:42:33] it's blocking the train and I'm not sure what code to revert in order to unblock it :-/ [19:43:04] !log cp* hosts rebooted (T131928) [19:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:43:16] T131928: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928 [19:43:26] twentyafterfour: yeah, no one has any idea what's causing this, afaik [19:43:54] we unfortunately have no team tasked with debugging all the random issues that come up (except Operations i guess ;) ) [19:44:25] (03CR) 10Bmansurov: [C: 04-1] "Take a look at the mw.experiment module's getBucket function documentation. You need to have something similar to the first argument's str" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292206 (owner: 10Jhobs) [19:44:36] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: investigate RAID failure on beryllium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T135178#2346110 (10Jgreen) 05Open>03Resolved rebuilt, kerberos data restored [19:45:14] MatmaRex: right, I don't really see hardly any references to that exception in the codebase. [19:45:23] I guess the train is held-up for now [19:45:34] * twentyafterfour pokes it with a stick [19:45:44] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack and setup Fundraising DB - https://phabricator.wikimedia.org/T136200#2346116 (10Jgreen) Perfect! [19:46:07] (03CR) 10Bmansurov: "Also, can you link to the bug: https://phabricator.wikimedia.org/T134778 ?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292206 (owner: 10Jhobs) [19:47:31] (03PS1) 10Dzahn: add endowment.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/292208 (https://phabricator.wikimedia.org/T136735) [19:48:29] 06Operations, 13Patch-For-Review: create endowment.wm.org microsite - https://phabricator.wikimedia.org/T136735#2346120 (10Dzahn) [19:50:49] 06Operations, 13Patch-For-Review: create endowment.wm.org microsite - https://phabricator.wikimedia.org/T136735#2346121 (10Dzahn) [19:51:44] twentyafterfour: legoktm: hmm [19:51:52] twentyafterfour: this works: https://test.wikidata.org/wiki/Special:ListUsers?limit=7 [19:52:00] twentyafterfour: this doesn't: https://test.wikidata.org/wiki/Special:ListUsers?limit=8 [19:52:20] i think it just doesn't like the 8th user there. (or the 9th maybe, if it queries ones past) [19:52:28] (03Abandoned) 10Gergő Tisza: [HOLD] Enable AuthManager on beta wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291782 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza) [19:52:37] PROBLEM - NTP on cp4017 is CRITICAL: NTP CRITICAL: Offset unknown [19:53:16] twentyafterfour: legoktm: doing the same query on stat1003, the 9th user's name is '2147483647', which i'll note is just a number. and the traceback says that a number is being passed there. i think something somewhere is trying to be helpful and casts number-like things to numbers [19:53:23] (03PS1) 10Dzahn: cache::misc add endowment.wm.org -> bromine [puppet] - 10https://gerrit.wikimedia.org/r/292210 (https://phabricator.wikimedia.org/T136735) [19:53:27] and TitleValue does strict type checking [19:54:38] twentyafterfour: so if you want to unblock the train, i say just cast $dbkey to string in TitleValue::__construct. [19:57:05] 06Operations, 13Patch-For-Review: create endowment.wm.org microsite - https://phabricator.wikimedia.org/T136735#2346131 (10Dzahn) [20:00:04] gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160601T2000). [20:00:18] 06Operations: request Gerrit project/repo wikimedia/endowment - https://phabricator.wikimedia.org/T136736#2346139 (10Dzahn) [20:00:27] ok jouncebot .. will get on it. [20:00:54] 06Operations, 10Traffic: Fix lvs1001-6 storage - https://phabricator.wikimedia.org/T136737#2346153 (10BBlack) [20:01:39] 06Operations: request Gerrit project/repo wikimedia/endowment - https://phabricator.wikimedia.org/T136736#2346139 (10Dzahn) let's clarify permissions and who has +2 though [20:01:39] (03CR) 10BBlack: [C: 031] cache::misc add endowment.wm.org -> bromine [puppet] - 10https://gerrit.wikimedia.org/r/292210 (https://phabricator.wikimedia.org/T136735) (owner: 10Dzahn) [20:03:22] legoktm: twentyafterfour: yeahhh, i think i got this [20:04:27] (03PS2) 10Dzahn: cache::misc add endowment.wm.org -> bromine [puppet] - 10https://gerrit.wikimedia.org/r/292210 (https://phabricator.wikimedia.org/T136735) [20:04:35] (03CR) 10Dzahn: [C: 032] cache::misc add endowment.wm.org -> bromine [puppet] - 10https://gerrit.wikimedia.org/r/292210 (https://phabricator.wikimedia.org/T136735) (owner: 10Dzahn) [20:06:20] twentyafterfour: legoktm: this is caused by 682116760198a7420a809e0b9966ecdc63f1c67d. i'll submit a patch in a few minutes [20:06:54] (you could also try reverting that, but it's an easy fix) [20:06:56] MatmaRex: indeed it is 682116760198a7420a809e0b9966ecdc63f1c67d [20:07:00] !log starting parsoid deploy [20:07:02] (I raised concern on the commit) [20:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:07:21] twentyafterfour: in phab? no one looks at those. :) comment on gerrit [20:07:34] anyway. patch coming, let me just merge all the tasks about this [20:08:19] MatmaRex: people should be looking at unbreak tasks they are cc'd to. And I don't think you can generalize that everyone checks gerrit and not phab [20:08:30] especially on an already merged patch [20:08:54] 06Operations: request Gerrit project/repo wikimedia/endowment - https://phabricator.wikimedia.org/T136736#2346174 (10Paladox) Add @QChris and @demon since they deal with creating repos. [20:10:02] (03PS5) 10Rush: introduce fileserver::secondary role [puppet] - 10https://gerrit.wikimedia.org/r/292205 (https://phabricator.wikimedia.org/T126083) [20:10:51] !log synced new code; restarted parsoid on wtp1001 as a canary [20:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:13:33] twentyafterfour: i'm just saying, if you want people to see your comments, put them on gerrit… if you don't want them to see, that's fine [20:13:42] twentyafterfour: anyway, the patch is https://gerrit.wikimedia.org/r/292215 [20:14:00] (03CR) 10Rush: [V: 032] introduce fileserver::secondary role [puppet] - 10https://gerrit.wikimedia.org/r/292205 (https://phabricator.wikimedia.org/T126083) (owner: 10Rush) [20:14:40] twentyafterfour: please backport and verify, this is just a drive-by patch and i won't be working to deploy it. busy with other stuff right now. :) [20:14:56] MatmaRex: thanks I'm deploying so I can handle it [20:15:04] much appreciated. [20:17:22] !log Rolling restart of RESTBase (redistribute Cassandra client connections?) : T126629 [20:17:23] T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629 [20:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:17:33] !log finished deploying parsoid sha afb0d522 [20:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:19:06] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:32] 06Operations, 10Traffic, 06Community-Liaisons (Apr-Jun-2016): Help contact bot owners about the end of HTTP access to the API - https://phabricator.wikimedia.org/T136674#2346235 (10BBlack) [20:20:57] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 6.715 second response time [20:29:48] (03PS20) 10Dzahn: install_server: split out reprepro to module aptrepo [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) [20:32:47] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [20:39:08] RECOVERY - NTP on cp4017 is OK: NTP OK: Offset 0.000150680542 secs [20:39:52] !log twentyafterfour@tin Synchronized php-1.28.0-wmf.4/includes/cache/LinkBatch.php: deploy https://gerrit.wikimedia.org/r/#/c/292217/ (duration: 00m 27s) [20:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:41:01] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/3021/" [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [20:42:28] watches that on carbon .. renaming roles and stuff [20:42:47] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [20:42:56] splits the aptrepo things from installserver [20:43:03] (03PS1) 1020after4: group1 wikis to 1.28.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292219 [20:43:27] (03CR) 1020after4: [C: 032] group1 wikis to 1.28.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292219 (owner: 1020after4) [20:44:03] (03Merged) 10jenkins-bot: group1 wikis to 1.28.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292219 (owner: 1020after4) [20:44:20] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.28.0-wmf.4 [20:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:47:07] (03PS1) 10RobH: update dns entries for relforge100[12] [dns] - 10https://gerrit.wikimedia.org/r/292222 [20:48:32] 06Operations, 10DBA, 06Labs: disk failure on labsdb1002 - https://phabricator.wikimedia.org/T126946#2346387 (10russblau) Sorry; it appears that I must have stopped reading before the end of the sentence. So, if importing revision will take roughly a month, that means that pagelinks will take another //thre... [20:48:47] PROBLEM - puppet last run on carbon is CRITICAL: CRITICAL: Puppet has 2 failures [20:49:43] (03CR) 10RobH: [C: 032] update dns entries for relforge100[12] [dns] - 10https://gerrit.wikimedia.org/r/292222 (owner: 10RobH) [20:50:43] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 06Labs, 10hardware-requests: rack/upgrade/setup/install/deploy relforge100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T136708#2346420 (10RobH) [20:52:26] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: puppet fail [20:55:16] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [20:55:33] !log starting mobileapps deploy [20:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:56:28] fixing boron ^^^ [20:56:37] !log restarting postgresql on maps2001 [20:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:57:15] (03CR) 10Milimetric: [C: 031] "ottomata, you can merge this at any point, I merged the repos it depended on" [puppet] - 10https://gerrit.wikimedia.org/r/289007 (https://phabricator.wikimedia.org/T126549) (owner: 10Milimetric) [20:58:43] !log mobileapps deployed ed0e2e4 [20:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:00:16] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 178 seconds ago with 0 failures [21:02:44] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [21:03:57] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/P3201" [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [21:06:04] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [21:07:51] (03PS1) 10Dzahn: install/aptrepo: move distributions file [puppet] - 10https://gerrit.wikimedia.org/r/292238 (https://phabricator.wikimedia.org/T132757) [21:09:17] (03PS2) 10Dzahn: install/aptrepo: move distributions file [puppet] - 10https://gerrit.wikimedia.org/r/292238 (https://phabricator.wikimedia.org/T132757) [21:09:27] (03CR) 10Dzahn: [C: 032] install/aptrepo: move distributions file [puppet] - 10https://gerrit.wikimedia.org/r/292238 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [21:14:54] RECOVERY - puppet last run on carbon is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [21:21:03] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 3 failures [21:21:25] (03PS1) 10Alexandros Kosiaris: ores: Allow specifying redis_password [puppet] - 10https://gerrit.wikimedia.org/r/292268 [21:21:57] (03PS1) 10Dzahn: aptrepo: un-change conf/incoming [puppet] - 10https://gerrit.wikimedia.org/r/292269 (https://phabricator.wikimedia.org/T132757) [21:22:34] (03PS2) 10Dzahn: aptrepo: un-change conf/incoming [puppet] - 10https://gerrit.wikimedia.org/r/292269 (https://phabricator.wikimedia.org/T132757) [21:22:55] (03CR) 10Dzahn: [C: 032] aptrepo: un-change conf/incoming [puppet] - 10https://gerrit.wikimedia.org/r/292269 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [21:24:14] (03CR) 10Dzahn: [V: 032] ".erb change only" [puppet] - 10https://gerrit.wikimedia.org/r/292269 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [21:26:00] (03CR) 10Dzahn: "root@carbon:/# diff /root/backup/srv/wikimedia/conf/incoming /srv/wikimedia/conf/incoming" [puppet] - 10https://gerrit.wikimedia.org/r/292269 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [21:26:07] !log twentyafterfour@tin Synchronized php-1.28.0-wmf.3/includes/specials/SpecialPrefixindex.php: sync https://gerrit.wikimedia.org/r/#/c/292234/ ( T136738 ) (duration: 00m 30s) [21:26:08] T136738: Pagination broken on Special:PrefixIndex - https://phabricator.wikimedia.org/T136738 [21:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:26:31] (03CR) 10Dzahn: "good now with follow-ups https://gerrit.wikimedia.org/r/#/c/292238/ and https://gerrit.wikimedia.org/r/#/c/292269/" [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [21:29:03] 06Operations, 13Patch-For-Review: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#2346688 (10Dzahn) [21:31:27] !log twentyafterfour@tin Synchronized php-1.28.0-wmf.4/includes/specials/SpecialPrefixindex.php: sync https://gerrit.wikimedia.org/r/#/c/292228/ ( T136738 ) (duration: 00m 26s) [21:31:28] T136738: Pagination broken on Special:PrefixIndex - https://phabricator.wikimedia.org/T136738 [21:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:32:35] hi, I'm sorry for leaving early yesterday before deployment of a patch I suggested. Would it be possible to include it in the current SWAT? (https://gerrit.wikimedia.org/r/#/c/284087/) [21:33:38] eranroz: I can deploy now if you'd like [21:33:42] (03CR) 1020after4: [C: 032] Show counts in category pages on he.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284087 (https://phabricator.wikimedia.org/T132972) (owner: 10Eranroz) [21:34:10] twentyafterfour: thanks. [21:34:15] (03PS3) 1020after4: Show counts in category pages on he.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284087 (https://phabricator.wikimedia.org/T132972) (owner: 10Eranroz) [21:35:08] (03CR) 1020after4: [C: 032] Show counts in category pages on he.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284087 (https://phabricator.wikimedia.org/T132972) (owner: 10Eranroz) [21:35:43] (03Merged) 10jenkins-bot: Show counts in category pages on he.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284087 (https://phabricator.wikimedia.org/T132972) (owner: 10Eranroz) [21:37:17] !log twentyafterfour@tin Synchronized wmf-config/InitialiseSettings.php: deploy /wmf-config/InitialiseSettings.php for eranroz ( T132972 ) (duration: 00m 25s) [21:37:18] T132972: Categorytree counts doesn't appear in hewiki - https://phabricator.wikimedia.org/T132972 [21:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:37:29] eranroz: how's that? [21:37:58] twentyafterfour: thanks! [21:38:02] works [21:38:13] eranroz: you're welcome [21:38:19] * twentyafterfour is finished deploying [21:38:31] !log train has left the station [21:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:44:19] (03PS1) 10Jforrester: Enable VisualEditor by default for logged-in users on four Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292274 [21:44:21] (03PS1) 10Jforrester: Enable VisualEditor by default for logged-out users on four Wikipedias too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292275 [21:46:21] (03PS1) 10Dzahn: install1001, add new aptrepo role like carbon [puppet] - 10https://gerrit.wikimedia.org/r/292276 (https://phabricator.wikimedia.org/T132757) [21:46:38] (03CR) 10Jforrester: "Scheduled for 13 June." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292275 (owner: 10Jforrester) [21:46:54] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:47:07] (03CR) 10Jforrester: [C: 04-1] "Scheduled for 6 June." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292274 (owner: 10Jforrester) [21:47:31] (03CR) 10Jforrester: [C: 04-1] Enable VisualEditor by default for logged-out users on four Wikipedias too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292275 (owner: 10Jforrester) [21:47:43] twentyafterfour: does CentralNotice have some special deployment schedule? it looks like https://phabricator.wikimedia.org/T136387 just regressed [21:48:07] awight: ^ [21:48:12] MatmaRex: not that I know of [21:48:16] (03CR) 10Dzahn: [C: 032] install1001, add new aptrepo role like carbon [puppet] - 10https://gerrit.wikimedia.org/r/292276 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [21:48:20] (03PS1) 10Alexandros Kosiaris: Fix ores redis password lookup [labs/private] - 10https://gerrit.wikimedia.org/r/292277 [21:48:37] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Fix ores redis password lookup [labs/private] - 10https://gerrit.wikimedia.org/r/292277 (owner: 10Alexandros Kosiaris) [21:50:03] MatmaRex: I didn't sync the whole tree so that shouldn't have been touched [21:50:06] I just did sync-file [21:51:50] twentyafterfour: the extensions doesn't seem to have the usual branches, it was a "wmf_deploy" branch though… [21:52:15] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.013 second response time [21:52:18] twentyafterfour: wanna deploy https://gerrit.wikimedia.org/r/#/c/292279/ ? [21:54:07] (it's a cherry-pick of the patch that worked for wmf.3) [22:01:23] ok [22:02:45] (03PS2) 10Alexandros Kosiaris: ores: Allow specifying redis_password [puppet] - 10https://gerrit.wikimedia.org/r/292268 [22:03:09] (03CR) 10jenkins-bot: [V: 04-1] ores: Allow specifying redis_password [puppet] - 10https://gerrit.wikimedia.org/r/292268 (owner: 10Alexandros Kosiaris) [22:08:33] PROBLEM - Graphite Carbon on labmon1001 is CRITICAL: CRITICAL: Not all configured Carbon instances are running. [22:08:35] MatmaRex: Thanks for the ping! [22:09:19] MatmaRex: I think it regressed because the change was not reverted from our deploy branch. We'll get a fix ready for SWAT in an hour. [22:10:37] awight: thanks. i think twentyafterfour is re-deploying that revert to wmf.4 in the meantime [22:10:51] MatmaRex: okay that will be fine for us, thanks! [22:12:51] (03CR) 10Jhobs: "The documentation around enabling an experiment does not seem to be very helpful. I'm entirely confused about what the name of the propert" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292206 (owner: 10Jhobs) [22:12:56] (03PS3) 10Alexandros Kosiaris: ores: Allow specifying redis_password [puppet] - 10https://gerrit.wikimedia.org/r/292268 [22:13:27] (03CR) 10jenkins-bot: [V: 04-1] ores: Allow specifying redis_password [puppet] - 10https://gerrit.wikimedia.org/r/292268 (owner: 10Alexandros Kosiaris) [22:15:42] (03PS2) 10Jhobs: Enable Hovercards for huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292206 (https://phabricator.wikimedia.org/T134778) [22:16:34] (03CR) 10jenkins-bot: [V: 04-1] Enable Hovercards for huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292206 (https://phabricator.wikimedia.org/T134778) (owner: 10Jhobs) [22:17:40] MatmaRex: deploying [22:18:22] (03PS3) 10Jhobs: Enable Hovercards for huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292206 (https://phabricator.wikimedia.org/T134778) [22:22:52] (03PS1) 10Yuvipanda: install_server: Change partman config for labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/292287 (https://phabricator.wikimedia.org/T127957) [22:23:11] robh: ^ is that the change? [22:23:33] or do I need to make a new config? [22:23:48] (03CR) 10RobH: [C: 031] install_server: Change partman config for labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/292287 (https://phabricator.wikimedia.org/T127957) (owner: 10Yuvipanda) [22:23:49] yep [22:23:55] well, that uses /srv [22:24:02] which is fine, I think [22:24:06] i may be mis-recalling, but didnt you use smoething else? [22:24:13] but yeah, srv is the wmf preferred. [22:24:25] robh: the current setup was hand partitioned, I guess [22:24:30] robh: there's a graphite specific one [22:24:52] but that's not under /srv [22:25:00] yeah, that was it [22:25:02] and mounts stuff in /var/lib/carbon [22:25:07] 06Operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 3 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#2347396 (10ori) Breakdown of 10,718,138 PURGEs, captured on 2016-06-01 between 18:00 and 22:00 UTC: ## To... [22:25:13] thats going to put that into a small section of your / root [22:25:24] so if its a lot of data, it may be problematic [22:25:45] robh: so labmon1001 was already configured to store graphite data in /srv/carbon [22:25:52] awesome [22:26:02] so i was just recalling another graphite system where that was it [22:26:06] robh: right. [22:26:08] so many servers oO [22:26:16] robh: so I'm wondering if I should make this be consistent with the graphite machines [22:26:19] and user graphite.cfg [22:26:24] or with most ofther machines [22:26:26] and use /srv [22:27:52] i think the graphite machines should conform to the rest of cluster [22:27:55] and use /srv [22:28:02] alright then [22:28:26] robh: I can merge this patch now, and that won't affect anything right until I try to re-install [22:28:46] robh: also it deals with the multiple disks properly, etc? (I can't read / understand partman configs yet :|) [22:30:39] !log twentyafterfour@tin Synchronized php-1.28.0-wmf.4/extensions/CentralNotice/: deploy https://gerrit.wikimedia.org/r/#/c/292279/ (duration: 00m 26s) [22:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:32:13] YuviPanda: as long as its 4 disks (it is) and we want a smaller / and a large /srv in an lvm in ext4 [22:32:14] then yep [22:33:44] (03PS2) 10Yuvipanda: install_server: Change partman config for labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/292287 (https://phabricator.wikimedia.org/T127957) [22:33:59] (03CR) 10Yuvipanda: [C: 032 V: 032] install_server: Change partman config for labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/292287 (https://phabricator.wikimedia.org/T127957) (owner: 10Yuvipanda) [22:37:23] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: Generic error: NoneType object has no attribute __getitem__ [22:38:38] robh: hmm, how do I specify jessie/trusty? I can't find that in https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Installation [22:38:43] 06Operations, 13Patch-For-Review: create endowment.wm.org microsite - https://phabricator.wikimedia.org/T136735#2346094 (10Krenair) Do we really want to allow more microsites? [22:39:04] YuviPanda: if trusty, its default do nothing [22:39:22] if jessie, check out the install_server/files/dhcpd/linux.hosts.something.1152.... [22:39:24] robh: I figured I might as well upgrade to jessie to match other graphite hosts [22:39:30] and the entries have jessie lines per host [22:39:39] grep that file for jessie and it will be very easy to see [22:39:57] ah [22:39:59] ok [22:40:04] linux-host-entries.ttyS1-115200 [22:41:24] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:41:32] PROBLEM - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:41:40] PROBLEM - NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:41:47] PROBLEM - Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:41:59] YuviPanda: is that labmon1001? [22:42:06] YuviPanda: ? [22:42:08] PROBLEM - Verify internal DNS from within Tools on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:42:37] I think tools checker is having issues because nfs is up and ok [22:42:37] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: Generic error: NoneType object has no attribute __getitem__ [22:43:09] (03PS1) 10Yuvipanda: install_server: Switch labmon1001 to jessie [puppet] - 10https://gerrit.wikimedia.org/r/292290 (https://phabricator.wikimedia.org/T127957) [22:43:11] hmm [22:43:13] no [22:43:28] none of those need to use graphite [22:43:38] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 6.437 second response time [22:43:39] I also took down graphite a while ago [22:43:45] it's not a failure it's a tiemout, i.e. some kind of reachability or load issue w/ teh checker host? [22:43:45] RECOVERY - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 2.266 second response time [22:43:51] looking now [22:43:53] RECOVERY - NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.540 second response time [22:44:12] RECOVERY - Test LDAP for query on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.113 second response time [22:44:24] I can verify that toolschecker doesn't touch labmon at all [22:44:35] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: Generic error: NoneType object has no attribute __getitem__ [22:44:35] both checker hosts seem fine [22:44:37] to me atm [22:44:39] (03PS4) 10Madhuvishy: uwsgi: Allow specifying plugins optionally as a uwsgi command line option [puppet] - 10https://gerrit.wikimedia.org/r/292030 [22:45:04] yeah [22:45:04] but unrelated checks failed at the same time and yet the services they are meant to test are up [22:45:13] PROBLEM - puppet last run on ms-be2005 is CRITICAL: CRITICAL: Puppet has 1 failures [22:45:22] somethign is weird w/ the check logic or race condition-y [22:45:27] not sure [22:45:40] chasemp: I think this might've been the parallelism change from a few weeks ago [22:45:43] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: Generic error: NoneType object has no attribute __getitem__ [22:46:01] YuviPanda: kind of what I was thinking but I have no proof of that in any way [22:46:02] chasemp: so if one hit takes more than 60s, it'll halt the others [22:46:05] me neither [22:46:29] gave me a nice little jolt to the heart tho w/ the pages [22:46:34] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: Generic error: NoneType object has no attribute __getitem__ [22:46:47] YuviPanda: why did we make that parallelism change agin? [22:46:53] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [22:47:03] chasemp: because we aren't re-entrant [22:47:05] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: Generic error: NoneType object has no attribute __getitem__ [22:47:13] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [22:47:32] chasemp: if you look at the webservice check [22:47:34] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [22:47:49] chasemp: if it is hit simultaneously twice, it'll fail [22:48:15] RECOVERY - Verify internal DNS from within Tools on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.025 second response time [22:48:19] right [22:48:27] chasemp: but I see now that'll cause subsequent checks to fail too [22:48:29] we'll have to figure something out here I imagine but seems like false alarm [22:48:32] yeah [22:48:49] I think multiple things - we should increase timeouts on these on the client side (in icinga) [22:48:54] PROBLEM - Puppet catalogue fetch on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:48:59] hmm [22:48:59] (03PS5) 10Madhuvishy: uwsgi: Allow specifying plugins optionally as a uwsgi command line option [puppet] - 10https://gerrit.wikimedia.org/r/292030 [22:49:01] it is flapping [22:49:36] hmm [22:49:40] why can't i find *any* logs?! [22:50:00] ah I do [22:50:17] I see some [22:50:19] SIGPIPE: writing to a closed pipe/socket/fd (probably the client disconnected) on request /service/start (ip 208.80.154.14) !!! [22:50:21] uwsgi_response_write_headers_do(): Broken pipe [core/writer.c line 216] [22:50:53] RECOVERY - Puppet catalogue fetch on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 4.004 second response time [22:51:09] chasemp: mind if I restart uwsgi there? [22:51:13] please [22:51:43] 208.80.154.14 is the address neon gets externally YuviPanda fyi [22:51:46] done [22:51:47] figured but also checked [22:51:48] chasemp: yeah [22:51:59] chasemp: so I think we should increase client timeouts to be a multiple of the server timeout [22:52:03] the server timeout is like 60s [22:52:10] it needs to be a multiple because of our serialization [22:52:20] *or* we could make the server code be re-entrant [22:52:43] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:52:50] it's going to get really unpopular really quick as-is :) [22:52:55] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:53:02] yeah :D [22:53:20] not sure what to do atm but dinner is burning so I gotta brb, can you bump looking at this up on your list YuviPanda? [22:53:25] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:53:26] chasemp: yep, doing now [22:55:08] (03PS1) 10Yuvipanda: icinga: Make the tools checks not page temporarily [puppet] - 10https://gerrit.wikimedia.org/r/292293 (https://phabricator.wikimedia.org/T136775) [22:55:25] 06Operations, 10Beta-Cluster-Infrastructure, 06Labs, 10Labs-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#527721 (10AlexMonk-WMF) [22:55:26] robh: does https://gerrit.wikimedia.org/r/#/c/292290/ look ok to you? [22:55:44] 06Operations, 10Beta-Cluster-Infrastructure, 06Labs, 10Labs-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#527731 (10AlexMonk-WMF) [22:55:51] (03PS2) 10Yuvipanda: icinga: Make the tools checks not page temporarily [puppet] - 10https://gerrit.wikimedia.org/r/292293 (https://phabricator.wikimedia.org/T136775) [22:56:57] (03CR) 10Yuvipanda: [C: 032 V: 032] icinga: Make the tools checks not page temporarily [puppet] - 10https://gerrit.wikimedia.org/r/292293 (https://phabricator.wikimedia.org/T136775) (owner: 10Yuvipanda) [22:57:43] chasemp: as a fyi I took them out of paging to begin with to give us some breathing room [22:59:17] (03CR) 10RobH: [C: 031] install_server: Switch labmon1001 to jessie [puppet] - 10https://gerrit.wikimedia.org/r/292290 (https://phabricator.wikimedia.org/T127957) (owner: 10Yuvipanda) [22:59:19] YuviPanda: yep [22:59:37] (03PS2) 10Yuvipanda: install_server: Switch labmon1001 to jessie [puppet] - 10https://gerrit.wikimedia.org/r/292290 (https://phabricator.wikimedia.org/T127957) [22:59:44] 06Operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 3 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#2347542 (10ori) @EBernhardson made it so when a job fragments into a number of child jobs, each child job h... [22:59:45] (03CR) 10Yuvipanda: [C: 032 V: 032] install_server: Switch labmon1001 to jessie [puppet] - 10https://gerrit.wikimedia.org/r/292290 (https://phabricator.wikimedia.org/T127957) (owner: 10Yuvipanda) [23:00:04] RoanKattouw ostriches Krenair MaxSem Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160601T2300). [23:00:04] awight James_F kaldari RoanKattouw Krenair: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:14] * RoanKattouw is here [23:00:26] Hello. [23:00:49] 06Operations, 10Traffic, 07Beta-Cluster-reproducible: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#2347550 (10AlexMonk-WMF) Yeah, those are definitely the right questions but I don't have any useful answers, and am not likely to have time... [23:02:24] * James_F waves too. [23:03:02] RoanKattouw: full scap required for https://gerrit.wikimedia.org/r/#/c/292270/ I guess? [23:03:31] looking into mobileapps [23:03:39] Dereckson: Urgh, yes :( sorry, I hadn't noticed that [23:06:22] PROBLEM - showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:06:25] Krenair: there? [23:06:47] PROBLEM - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:06:55] !log Update restbase staging to f05b66f [23:06:56] here [23:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:07:09] (03PS2) 10Dereckson: Follow-up abd14947: Use HTTPS URL to citoid instead of protocol-relative [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292068 (https://phabricator.wikimedia.org/T136423) (owner: 10Alex Monk) [23:07:16] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292068 (https://phabricator.wikimedia.org/T136423) (owner: 10Alex Monk) [23:08:05] (03Merged) 10jenkins-bot: Follow-up abd14947: Use HTTPS URL to citoid instead of protocol-relative [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292068 (https://phabricator.wikimedia.org/T136423) (owner: 10Alex Monk) [23:08:13] RECOVERY - showmount succeeds on a labs instance on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.043 second response time [23:08:37] RECOVERY - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.027 second response time [23:10:14] !log dereckson@tin Synchronized wmf-config/CommonSettings.php: Use HTTPS URL to citoid instead of protocol-relative (T136423) (duration: 00m 32s) [23:10:15] T136423: [Regression] Cannot add automatic citation from Beta Cluster because of HTTP -> HTTPS redirect - https://phabricator.wikimedia.org/T136423 [23:10:17] Krenair: please test ^ [23:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:10:45] awight: would you be around? [23:11:03] RECOVERY - puppet last run on ms-be2005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:11:07] Dereckson: yep, I'm around [23:11:13] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:11:14] RL caching [23:11:15] sigh [23:11:16] To test the CentralNotice config? [23:11:54] (03PS6) 10Dereckson: Use full URL in $wgNoticeHideUrls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T130442) (owner: 10Nemo bis) [23:12:01] gotcha, I'm ready [23:12:08] AndyRussG: ^ [23:12:13] (03CR) 10jenkins-bot: [V: 04-1] Use full URL in $wgNoticeHideUrls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T130442) (owner: 10Nemo bis) [23:13:10] (Zuul merged the wmf4 changes) [23:13:12] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 8.241 second response time [23:13:18] PROBLEM - MariaDB Slave Lag: s1 on db1053 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.64 seconds [23:13:57] * AndyRussG watches [23:14:12] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:14:13] awight: I'm manually rebasing it [23:14:24] Dereckson, okay, prod seems fine [23:15:27] Krenair: good :) [23:15:42] awight: ah yes, array() → [] [23:17:11] urrgh [23:17:19] PROBLEM - MariaDB Slave Lag: s1 on db1053 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 331.97 seconds [23:17:28] beta seems fixed [23:17:29] thanks [23:17:36] !log Deploying d8fa5c0 to RESTBase production [23:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:18:03] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.044 second response time [23:18:09] hrmm, so if those slave lags dont hit more than 5 minutes behind are we ok? [23:18:26] i know they were re-enabled due to the recent spikes in lag (that caused issues) [23:18:29] (03PS7) 10Dereckson: Use full URL in $wgNoticeHideUrls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T130442) (owner: 10Nemo bis) [23:19:04] (03CR) 10Dereckson: "PS7: rebased against master, short array syntax" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T130442) (owner: 10Nemo bis) [23:19:14] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:19:14] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T130442) (owner: 10Nemo bis) [23:19:24] (03CR) 10Awight: [C: 031] "Looks correct." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T130442) (owner: 10Nemo bis) [23:19:53] (03CR) 10Dereckson: "SWAT, take two" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T130442) (owner: 10Nemo bis) [23:20:02] (03Merged) 10jenkins-bot: Use full URL in $wgNoticeHideUrls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T130442) (owner: 10Nemo bis) [23:20:32] robh: Hmm, it's a dumps host, so how much lag is acceptable would be an aper.gos / DBA question [23:21:11] !log dereckson@tin Synchronized wmf-config/CommonSettings.php: Use full URL in $wgNoticeHideUrls (T130442) (duration: 00m 23s) [23:21:11] robh: For non-dumps DBs, my understanding is that lag over 5s is concerning and lag over 30s is a problem [23:21:12] T130442: Use standard URL index.php?title= ... for background requests - https://phabricator.wikimedia.org/T130442 [23:21:15] awight: please test ^ [23:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:21:42] Dereckson: testing... [23:21:45] twentyafterfour: was the CentralNotice change deployed earlier? i still get the old code at https://commons.wikimedia.org/wiki/Special:UploadWizard [23:22:07] (the symptom is, UW doesn't load in IE 11) [23:22:17] 06Operations, 06Labs, 10netops: Intermittent bandwidth issue to labs proxy (eqiad) from Comcast in Portland OR - https://phabricator.wikimedia.org/T136671#2347595 (10brion) Currently seeing my baseline 80 Mbps; floating IP 208.80.155.243 has been assigned for now to test without the proxy, just to double-con... [23:22:49] RoanKattouw: that is far less concerning then. it seems bad if dumps were lagged, but not bad enough to wake one of them up. [23:22:57] Yeah exactly [23:22:58] (03PS1) 10Yuvipanda: icinga: Increase timeout for tools-checker checks [puppet] - 10https://gerrit.wikimedia.org/r/292297 (https://phabricator.wikimedia.org/T136775) [23:23:06] thanks for the info =] [23:23:12] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 7.073 second response time [23:23:20] If a production DB were lagged by something measured in minutes, that would be a stop the presses, wake people up kind of thing [23:23:37] So I was very confused at first :) but for dumps it's probably not urgent [23:25:14] PROBLEM - puppet last run on cp2015 is CRITICAL: CRITICAL: Puppet has 1 failures [23:26:36] !log dereckson@tin Synchronized php-1.28.0-wmf.4/extensions/VisualEditor/modules/ve-mw/init/targets/ve.init.mw.DesktopArticleTarget.js: Simplify teardown of toolbar save button (T136421) (duration: 00m 23s) [23:26:37] T136421: [Regression wmf.4] Cannot open save dialog in a session which followed immediately after a successful save with VE; event not re-attached? - https://phabricator.wikimedia.org/T136421 [23:26:39] James_F: here you are for VE, please test ^ [23:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:26:50] Dereckson: I'm here, BTW [23:27:04] not sure if I missed the boat :) [23:27:46] Hi, we're still SWATting. [23:27:51] Dereckson: LGTM. [23:28:04] cool, I just have those 2 config changes [23:28:41] and 1 is Labs only [23:29:03] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292158 (https://phabricator.wikimedia.org/T125551) (owner: 10Kaldari) [23:29:04] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:29:09] James_F: thanks for testing [23:29:18] cool [23:29:21] this flake didn't page! [23:29:36] Dereckson: Thank you for deploying. :-) [23:29:44] You're welcome. [23:29:49] 07Puppet, 10ORES, 06Revision-Scoring-As-A-Service, 13Patch-For-Review: ORES-staging is broken due to service::uwsgi mandatory scap::target invoke - https://phabricator.wikimedia.org/T136488#2347636 (10Ladsgroup) a:05Ladsgroup>03None [23:30:05] robh: can you take a quick look at https://gerrit.wikimedia.org/r/#/c/292297/ when you have the chance? it should fix the flaky tools checker paging [23:30:22] Dereckson: just let me know when I should test [23:30:30] 07Puppet, 10ORES, 06Revision-Scoring-As-A-Service, 13Patch-For-Review: ORES-staging is broken due to service::uwsgi mandatory scap::target invoke - https://phabricator.wikimedia.org/T136488#2336803 (10Ladsgroup) Removing myself an the assignee since yuvi is working on it [23:30:51] YuviPanda: i can but the one time i added to that file it took me multiple failed attempts! [23:31:00] :D [23:31:02] 07Puppet, 10ORES, 06Revision-Scoring-As-A-Service, 13Patch-For-Review: ORES-staging is broken due to service::uwsgi mandatory scap::target invoke - https://phabricator.wikimedia.org/T136488#2347651 (10yuvipanda) a:03yuvipanda [23:31:09] kaldari: hey could you rebase against https://gerrit.wikimedia.org/r/#/c/281239/ so we avoid wmg/wg hack? [23:31:23] robh: I'll babysit it to make sure it's fine etc [23:31:44] looks sane to me though, I'll +1 [23:32:01] robh: ok! [23:32:12] (03CR) 10RobH: [C: 031] "looks sane to me (with the disclaimer my attempts to add to nagios/icinga checks took multiple attempts)" [puppet] - 10https://gerrit.wikimedia.org/r/292297 (https://phabricator.wikimedia.org/T136775) (owner: 10Yuvipanda) [23:32:52] Dereckson: awight: CentralNotice config update works well from here! [23:33:03] MatmaRex: ^ [23:33:27] (03PS2) 10Yuvipanda: icinga: Increase timeout for tools-checker checks [puppet] - 10https://gerrit.wikimedia.org/r/292297 (https://phabricator.wikimedia.org/T136775) [23:33:54] (03CR) 10Yuvipanda: [C: 032 V: 032] icinga: Increase timeout for tools-checker checks [puppet] - 10https://gerrit.wikimedia.org/r/292297 (https://phabricator.wikimedia.org/T136775) (owner: 10Yuvipanda) [23:34:05] PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:34:16] Dereckson: that's not the issue MatmaRex mentioned... [23:34:41] (03PS2) 10Dereckson: Test PageAssessments extension on Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292158 (https://phabricator.wikimedia.org/T125551) (owner: 10Kaldari) [23:35:08] AndyRussG: oh you tested Use full URL in $wgNoticeHideUrls [23:35:10] * YuviPanda forces a run on neon [23:35:58] (03CR) 10Dereckson: [C: 032] "Rebased, SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292158 (https://phabricator.wikimedia.org/T125551) (owner: 10Kaldari) [23:36:05] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 6.885 second response time [23:36:05] RECOVERY - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 6.874 second response time [23:36:18] !log Deploying RESTBase to xenon.eqiad.wmnet (canary node) [23:36:25] PROBLEM - NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:36:37] MatmaRex awight Derekson, yeah the CN problem for IE is still showing up on production on mediawiki.org [23:36:40] (03Merged) 10jenkins-bot: Test PageAssessments extension on Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292158 (https://phabricator.wikimedia.org/T125551) (owner: 10Kaldari) [23:37:10] Huh I thought it was deployed [23:37:19] !log dereckson@tin Synchronized wmf-config/InitialiseSettings-labs.php: Test PageAssessments extension on Labs (no-op) (duration: 00m 30s) [23:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:37:30] AndyRussG: it was supposed to be. ehhh [23:37:46] 06Operations, 10Beta-Cluster-Infrastructure, 06Labs, 10Labs-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#2347667 (10Dzahn) Its puppetized now, should not be hard anymore. We already use it in prod. [23:37:57] Well that's the comment I saw on Phab [23:38:01] !log dereckson@tin Synchronized wmf-config/CommonSettings-labs.php: Test PageAssessments extension on Labs (no-op) (duration: 00m 26s) [23:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:38:36] RECOVERY - NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.023 second response time [23:38:56] PROBLEM - Puppet catalogue fetch on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:39:12] !log RESTBase deploy to xenon.eqiad.wmnet (canary node) complete [23:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:39:27] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:39:37] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:39:46] PROBLEM - showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:39:57] PROBLEM - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:40:16] RECOVERY - Puppet catalogue fetch on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 3.037 second response time [23:40:27] (03PS2) 10Dereckson: Use extension registration for SpamBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281239 (https://phabricator.wikimedia.org/T119117) [23:40:41] (03PS6) 10Madhuvishy: uwsgi: Allow specifying plugins as a uwsgi command line option [puppet] - 10https://gerrit.wikimedia.org/r/292030 [23:40:56] !log Deploying RESTBase to staging environment [23:41:03] * Dereckson adds it to SWAT. [23:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:41:06] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281239 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [23:41:07] Dereckson: how long does it usually take changes to the beta labs config to take effect? [23:41:19] kaldari: depends of Jenkins [23:41:20] AndyRussG: yeah, something's clearly fucked. i really don't have the ability to investigate what [23:41:30] but i'm sad that this kind of thing keeps happening [23:41:41] kaldari: 30 seconds to 5 minutes when the job isn't stuck [23:41:51] (03Merged) 10jenkins-bot: Use extension registration for SpamBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281239 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [23:41:55] kaldari: there is a Jenkins job picking automatically merged change [23:42:00] and deploying to labs [23:42:01] MatmaRex: I'm incredibly happy u spotted it, and that it hasn't gone out to enwiki or other WPs [23:42:07] (03CR) 10jenkins-bot: [V: 04-1] uwsgi: Allow specifying plugins as a uwsgi command line option [puppet] - 10https://gerrit.wikimedia.org/r/292030 (owner: 10Madhuvishy) [23:42:45] Dereckson: not sure I understand your suggestions above: "hey could you rebase against https://gerrit.wikimedia.org/r/#/c/281239/ so we avoid wmg/wg hack?" [23:42:56] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 7.034 second response time [23:43:04] i semi-regularly encounter issues where different version of something is supposed to be deployed, is actually deployed, and is reported by special:version [23:43:07] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.330 second response time [23:43:11] Dereckson: can u tell me what commit for CentralNotice submodule you've been deploying, pls? For wmf.4? [23:43:17] RECOVERY - showmount succeeds on a labs instance on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.048 second response time [23:43:25] kaldari: it's live on labs: https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/5016/console [23:43:36] RECOVERY - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.033 second response time [23:44:19] !log Deploy of RESTBase to staging environment complete [23:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:44:37] AndyRussG: none, but awight deployed https://gerrit.wikimedia.org/r/#/c/291118/ [23:46:09] AndyRussG: so nothing deployed since 2016-05-27 [23:46:11] Dereckson: hmm, doesn't seem to have had any effect: http://simple.wikipedia.beta.wmflabs.org/wiki/Special:Version. Do you know where I can check for errors for beta labs? [23:46:11] (03PS7) 10Madhuvishy: uwsgi: Allow specifying plugins as a uwsgi command line option [puppet] - 10https://gerrit.wikimedia.org/r/292030 [23:46:29] kaldari: you could ask on #wikimedia-releng for assistance [23:46:35] thanks [23:46:47] Dereckson: I mean, you're on tin syncing stuff, right? What commit sha do you see for the CentralNotice submodule on the box you're syncing from? [23:46:56] AndyRussG: 88137bbe7e098385e0711e0ba72fd326e4e395d9 [23:49:07] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:49:07] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:49:07] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:49:07] PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:49:11] 06Operations, 13Patch-For-Review: create endowment.wm.org microsite - https://phabricator.wikimedia.org/T136735#2347689 (10Dzahn) >>! In T136735#2347455, @Krenair wrote: > Do we really want to allow more microsites? Personally i think it would be better to use wiki. But you gotta ask MBrent's team. This is j... [23:49:26] #wikimedia-relend is just botspam :/ [23:49:30] !log Deploying cdff5e3 to restbase1008.eqiad.wmnet (canary node) [23:49:33] releng* [23:49:37] (03PS2) 10Dzahn: add endowment.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/292208 (https://phabricator.wikimedia.org/T136735) [23:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:49:45] Dereckson: yeah, looks like it's throwing errors. I'll submit a patch to revert the labs config. [23:49:49] !log dereckson@tin Synchronized wmf-config/CommonSettings.php: Use extension registration for SpamBlacklist (T119117) (duration: 00m 24s) [23:49:50] T119117: Get rid of $wg = $wmg hack for extensions that have been converted to using extension.json - https://phabricator.wikimedia.org/T119117 [23:49:52] kaldari: fine [23:49:56] Dereckson: so why isn't https://gerrit.wikimedia.org/r/#/c/292279/ on there? [23:50:34] !log Deploy of cdff5e3 to restbase1008.eqiad.wmnet (canary node), complete [23:50:37] RECOVERY - puppet last run on cp2015 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [23:50:37] (03PS1) 10Kaldari: Revert "Test PageAssessments extension on Labs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292300 [23:50:49] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure: connect usb external disk to labmon1001 - https://phabricator.wikimedia.org/T136242#2347693 (10RobH) 05Resolved>03Open So the data is copying, but very slowly. At this rate, its over 1.8 days of time. While it will remain in rsync overnight, o... [23:50:58] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.323 second response time [23:51:06] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 7.624 second response time [23:51:06] RECOVERY - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 7.628 second response time [23:51:07] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 7.615 second response time [23:51:09] Dereckson: Here's the revert: https://gerrit.wikimedia.org/r/#/c/292300/ , sorry for the hassle. [23:51:12] Dereckson: we need to update the CentralNotice submodule for wmf.4 and re-sync [23:51:30] !log dereckson@tin Synchronized wmf-config/CommonSettings.php: Revert Use extension registration for SpamBlacklist (T119117) (duration: 00m 24s) [23:51:31] T119117: Get rid of $wg = $wmg hack for extensions that have been converted to using extension.json - https://phabricator.wikimedia.org/T119117 [23:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:51:40] PHP fatal error /srv/mediawiki/php-1.28.0-wmf.3/includes/GlobalFunctions.php line 115: [23:51:40] /srv/mediawiki/php-1.28.0-wmf.3/extensions/SpamBlackList/extension.json does not exist! [23:51:44] Sorry, I thought somehow it was deployed [23:51:49] vito fixed [23:51:50] already known or worth reporting? [23:51:56] great [23:52:03] Dereckson: when can we expect this change to be pushed? [23:52:14] right now [23:52:16] PROBLEM - Apache HTTP on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:52:17] cool [23:52:32] enterprisey: this = 292300? [23:52:39] yes [23:52:39] (03PS2) 10Dereckson: Revert "Test PageAssessments extension on Labs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292300 (owner: 10Kaldari) [23:54:03] enterprisey: hold on, I've a revert to commit first [23:54:13] oh, thanks [23:54:26] I was originally wondering about the SpamBlackList-related push [23:54:28] PROBLEM - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:54:28] PROBLEM - HHVM rendering on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:54:29] PROBLEM - NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:54:36] so, nevermind, I guess, regarding 292300 [23:55:08] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [23:55:14] hmm [23:55:18] RECOVERY - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.018 second response time [23:55:19] RECOVERY - NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.024 second response time [23:55:26] (03PS1) 10Dereckson: Revert "Use extension registration for SpamBlacklist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292301 [23:55:38] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [23:55:49] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [23:55:54] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292301 (owner: 10Dereckson) [23:56:34] (03Merged) 10jenkins-bot: Revert "Use extension registration for SpamBlacklist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292301 (owner: 10Dereckson) [23:56:58] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit nfs-exports is failed [23:56:59] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [23:56:59] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [23:57:07] !log Deploying cdff5e3 to RESTBase production [23:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:57:33] Dereckson: hmm, there's a 5xx spike - do you think it might be related to a deployment? [23:57:35] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292300 (owner: 10Kaldari) [23:57:42] (03PS3) 10Dereckson: Revert "Test PageAssessments extension on Labs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292300 (owner: 10Kaldari) [23:57:52] https://grafana.wikimedia.org/dashboard/db/varnish-http-errors [23:58:01] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292300 (owner: 10Kaldari) [23:58:23] Dereckson: eek, lowercase L :( [23:58:53] YuviPanda: yes, I merged a wrong extension upgrade patch, reverted [23:58:57] ah [23:58:59] ok [23:59:00] (03Merged) 10jenkins-bot: Revert "Test PageAssessments extension on Labs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292300 (owner: 10Kaldari)