404 Not Found

[00:01:37] 10Operations, 10Beta-Cluster-Infrastructure, 10Wikimedia-General-or-Unknown: Beta English Wikipedia: History of the page 'Bird' generates a 500 or 503 error - https://phabricator.wikimedia.org/T185969#3929818 (10Paladox) I can confirm that it shows a 503 when logged in. [00:05:38] 10Operations, 10ops-codfw, 10netops: rack spare switches in c1-codfw - https://phabricator.wikimedia.org/T185336#3939709 (10ayounsi) 05Open>03Resolved OS upgraded to 14.1X53-D43.7. No system alarms. Configuration zeroized. [00:06:05] jouncebot: refresh [00:06:08] I refreshed my knowledge about deployments. [00:06:44] 10Operations, 10Analytics-Data-Quality, 10Traffic: Vet reliability of the response_size field for data analysis purposes - https://phabricator.wikimedia.org/T185350#3939712 (10Dzahn) p:05Triage>03Normal @Tbayer purely from a ticket triaging perspective: since the ticket title is "vet reliability of the... [00:09:39] 10Operations, 10ops-eqiad, 10Analytics: rack/setup/install notebook100[34] - https://phabricator.wikimedia.org/T183935#3939716 (10RobH) a:05RobH>03elukey [00:10:40] too many open tasks misasisgning them between people =P [00:10:41] 10Operations, 10ops-eqiad, 10Analytics: rack/setup/install notebook100[34] - https://phabricator.wikimedia.org/T183935#3867856 (10RobH) a:05elukey>03Gehel These are finishing their initial puppet runs and are ready to be pushed into service role. Escalating to @elukey. [00:10:44] 10Operations, 10ops-eqiad, 10Analytics: rack/setup/install notebook100[34] - https://phabricator.wikimedia.org/T183935#3939722 (10RobH) a:05Gehel>03elukey [00:15:14] 10Operations, 10ops-esams, 10netops: replace msw1-esams - https://phabricator.wikimedia.org/T185151#3939723 (10Dzahn) p:05Triage>03Normal [00:17:07] (03CR) 10Dzahn: [C: 031] "+1 in the literal "needs somebody else to approve" sense. tiny nitpick there is a typo in the comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/399101 (owner: 10Ayounsi) [00:19:45] 10Operations, 10ops-esams, 10Epic: SRE 2017-18 Q3 goal Cleanup esams and refresh servers and infrastructure (tracking) - https://phabricator.wikimedia.org/T184061#3871139 (10ayounsi) Mentioning T185151 here as well so it's on your radar (doesn't have to be Q3). [00:29:18] (03PS6) 10Madhuvishy: NFS: add custom script to generate target hosts [puppet] - 10https://gerrit.wikimedia.org/r/406779 (https://phabricator.wikimedia.org/T185967) (owner: 10Volans) [00:29:46] (03CR) 10jerkins-bot: [V: 04-1] NFS: add custom script to generate target hosts [puppet] - 10https://gerrit.wikimedia.org/r/406779 (https://phabricator.wikimedia.org/T185967) (owner: 10Volans) [00:30:49] (03PS7) 10Madhuvishy: NFS: add custom script to generate target hosts [puppet] - 10https://gerrit.wikimedia.org/r/406779 (https://phabricator.wikimedia.org/T185967) (owner: 10Volans) [00:34:29] 10Operations, 10Analytics-Kanban, 10User-Elukey: Expand meitnerium's root partition to 100G - https://phabricator.wikimedia.org/T186020#3939747 (10Dzahn) ``` .. Fri Feb 2 00:23:13 2018 - INFO: - device disk/1: 99.30% done, 21s remaining (estimated) Fri Feb 2 00:23:34 2018 - INFO: - device disk/1: 100.00%... [00:38:25] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [00:39:03] 10Operations, 10Analytics-Kanban, 10User-Elukey: Expand meitnerium's root partition to 100G - https://phabricator.wikimedia.org/T186020#3939748 (10Dzahn) @elukey so yea, now we'd have to restart the instance from ganeti, as the comment above says rebooting from within the instance won't do it. You said above... [00:40:35] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 22 probes of 290 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [00:45:26] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 11 probes of 290 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [01:13:25] (03PS1) 10Yuvipanda: Remove access for myself [puppet] - 10https://gerrit.wikimedia.org/r/407577 [01:13:30] 10Operations: upgrade all Ubuntu (trusty) hosts in production - https://phabricator.wikimedia.org/T186288#3939825 (10Dzahn) [01:20:46] 10Operations, 10cloud-services-team: replace all Ubuntu (trusty) hosts in production with Debian - https://phabricator.wikimedia.org/T186288#3939825 (10Dzahn) [01:23:14] 10Operations, 10cloud-services-team: replace all Ubuntu (trusty) hosts in production with Debian - https://phabricator.wikimedia.org/T186288#3939859 (10Dzahn) [01:35:10] (03Abandoned) 10Yuvipanda: labs: Only include nfsclient if *any* nfs mounts are enabled [puppet] - 10https://gerrit.wikimedia.org/r/333227 (owner: 10Yuvipanda) [01:42:31] 10Operations, 10cloud-services-team: replace all Ubuntu (trusty) hosts in production with Debian - https://phabricator.wikimedia.org/T186288#3939867 (10bd808) [01:43:24] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [01:44:43] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:48:33] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 2.09 ms [01:49:53] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 3.17 ms [02:39:51] (03CR) 10Legoktm: ":(" [puppet] - 10https://gerrit.wikimedia.org/r/407577 (owner: 10Yuvipanda) [02:49:58] 10Operations, 10Analytics-Data-Quality, 10Traffic: Vet reliability of the response_size field for data analysis purposes - https://phabricator.wikimedia.org/T185350#3940039 (10BBlack) @Dzahn I was planning to follow up a bit on some of the remaining questions above, just haven't gotten there yet :) [03:24:33] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 831.32 seconds [03:52:43] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 198.12 seconds [04:06:54] 10Operations, 10Analytics, 10Research, 10Traffic, and 6 others: Referrer policy for browsers which only support the old spec - https://phabricator.wikimedia.org/T180921#3940150 (10Tgr) @Nuria any thoughts about the next step? Should we just enable the fallback and check the data to see if it had any unexpe... [04:30:23] PROBLEM - HHVM rendering on mw2115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:31:13] PROBLEM - SSH cp3038.mgmt on cp3038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:31:17] RECOVERY - HHVM rendering on mw2115 is OK: HTTP OK: HTTP/1.1 200 OK - 79949 bytes in 0.294 second response time [05:00:43] PROBLEM - puppet last run on baham is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:01:03] PROBLEM - puppet last run on logstash1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:02:07] PROBLEM - puppet last run on mw1312 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:02:36] PROBLEM - puppet last run on rdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:02:43] PROBLEM - puppet last run on aqs1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:02:43] PROBLEM - puppet last run on elastic1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:02:44] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:02:53] PROBLEM - puppet last run on boron is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:03:13] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:03:23] PROBLEM - puppet last run on conf1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:03:23] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:03:54] PROBLEM - puppet last run on db1097 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:04:23] PROBLEM - puppet last run on mw1275 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:05:13] PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:06:26] puppetdb on nitrogen killed by oom --^ [05:06:29] [Fri Feb 2 04:57:52 2018] Killed process 20908 (java) total-vm:12292464kB, anon-rss:6269300kB, file-rss:0kB, shmem-rss:0kB [05:07:23] elukey: o/ [05:07:51] hello ema! <3 [05:11:13] elukey: anything to do when puppetdb gets shot, or does it recover on its own? [05:12:36] ema: systemd restarts it and the next puppet runs recovers automagically [05:13:32] ok! [05:28:54] RECOVERY - puppet last run on db1097 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [05:29:23] RECOVERY - puppet last run on mw1275 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:30:13] RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:30:43] RECOVERY - puppet last run on baham is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:31:03] RECOVERY - puppet last run on logstash1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [05:31:04] RECOVERY - SSH cp3038.mgmt on cp3038.mgmt is OK: SSH OK - OpenSSH_5.8 (protocol 2.0) [05:32:03] RECOVERY - puppet last run on mw1312 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [05:32:33] RECOVERY - puppet last run on rdb1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [05:32:43] RECOVERY - puppet last run on aqs1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [05:32:43] RECOVERY - puppet last run on elastic1048 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [05:32:44] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [05:32:53] RECOVERY - puppet last run on boron is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [05:33:14] RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [05:33:23] RECOVERY - puppet last run on conf1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [05:33:23] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [05:37:17] !log truncate /var/log/aphlict/aphlict.log to 25G as temp measure to avoid phab1001's root partition to fill up [05:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:10] (03PS1) 10Elukey: phabricator: fix aphlict's logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/407583 [05:57:05] (03CR) 10Elukey: [C: 032] phabricator: fix aphlict's logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/407583 (owner: 10Elukey) [05:58:54] for some reason pcc didn't show any diff for --^ [06:10:35] (03PS2) 10Dzahn: DHCP: Add MAC address entry for tendril2001 [puppet] - 10https://gerrit.wikimedia.org/r/407457 (https://phabricator.wikimedia.org/T186123) (owner: 10Papaul) [06:13:16] (03CR) 10Dzahn: [C: 032] "racadm getsysinfo" [puppet] - 10https://gerrit.wikimedia.org/r/407457 (https://phabricator.wikimedia.org/T186123) (owner: 10Papaul) [06:13:23] (03PS3) 10Dzahn: DHCP: Add MAC address entry for tendril2001 [puppet] - 10https://gerrit.wikimedia.org/r/407457 (https://phabricator.wikimedia.org/T186123) (owner: 10Papaul) [06:16:56] (03CR) 10Dzahn: "thank you for this :) +20after4 +paladox" [puppet] - 10https://gerrit.wikimedia.org/r/407583 (owner: 10Elukey) [06:25:00] /away [06:27:43] PROBLEM - graphite.wikimedia.org on graphite1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.002 second response time [06:28:43] RECOVERY - graphite.wikimedia.org on graphite1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 0.009 second response time [06:52:07] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3940279 (10elukey) I picked one of the httpd processes from server-status in... [06:53:44] (03CR) 10Elukey: "Can I just truncate the file to 1G or does it contain useful things? Now it is 25G :)" [puppet] - 10https://gerrit.wikimedia.org/r/407583 (owner: 10Elukey) [07:03:48] (03PS1) 10Marostegui: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407585 (https://phabricator.wikimedia.org/T162807) [07:08:09] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407585 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [07:09:50] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407585 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [07:10:00] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407585 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [07:10:26] !log Fixing data drifts on db1065 - T162807 [07:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:41] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [07:11:09] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1065 - T162807 (duration: 00m 55s) [07:11:18] 10Operations, 10Analytics-Kanban, 10User-Elukey: Expand meitnerium's root partition to 100G - https://phabricator.wikimedia.org/T186020#3940290 (10elukey) >>! In T186020#3939748, @Dzahn wrote: > @elukey so yea, now we'd have to restart the instance from ganeti, as the comment above says rebooting from within... [07:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:47] (03PS1) 10Marostegui: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407586 (https://phabricator.wikimedia.org/T162807) [07:35:51] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407586 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [07:37:27] (03PS1) 10Elukey: README.md: add a note about glibc depencency [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/407587 (https://phabricator.wikimedia.org/T186169) [07:38:03] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407586 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [07:38:13] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407586 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [07:39:17] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1089 - T162807 (duration: 00m 55s) [07:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:28] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [07:50:46] (03CR) 10Muehlenhoff: README.md: add a note about glibc depencency (031 comment) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/407587 (https://phabricator.wikimedia.org/T186169) (owner: 10Elukey) [07:54:29] (03CR) 10Elukey: README.md: add a note about glibc depencency (031 comment) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/407587 (https://phabricator.wikimedia.org/T186169) (owner: 10Elukey) [07:55:32] (03PS2) 10Elukey: README.md: add a note about glibc depencency [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/407587 (https://phabricator.wikimedia.org/T186169) [07:59:39] (03CR) 10Muehlenhoff: [C: 031] README.md: add a note about glibc depencency [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/407587 (https://phabricator.wikimedia.org/T186169) (owner: 10Elukey) [08:10:51] (03PS1) 10Ema: cache_upload: upgrade cp4021 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/407594 (https://phabricator.wikimedia.org/T180433) [08:12:37] (03CR) 10Ema: [C: 032] cache_upload: upgrade cp4021 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/407594 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [08:14:00] !log cache_upload: upgrade cp4021 to varnish 5 [08:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:15] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3940344 (10Paladox) @elukey thank you for looking into this :). Is our next s... [08:21:06] !log cache_upload: repool cp4021 (varnish 5) [08:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:01] !log installing curl security updates on trusty (Debian already updated) [08:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:13] !log Stop replication in sync db1089 - db1065 - T162807 [08:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:27] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [08:27:56] (03PS1) 10Ema: cache_upload: upgrade cp4022 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/407598 (https://phabricator.wikimedia.org/T180433) [08:29:11] (03CR) 10Ema: [C: 032] cache_upload: upgrade cp4022 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/407598 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [08:29:49] !log cache_upload: upgrade cp4022 to varnish 5 [08:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:24] PROBLEM - Varnish HTTP upload-frontend - port 3127 on cp4022 is CRITICAL: connect to address 10.128.0.122 and port 3127: Connection refused [08:33:43] PROBLEM - Varnish HTTP upload-frontend - port 3124 on cp4022 is CRITICAL: connect to address 10.128.0.122 and port 3124: Connection refused [08:33:44] PROBLEM - Varnish HTTP upload-frontend - port 80 on cp4022 is CRITICAL: connect to address 10.128.0.122 and port 80: Connection refused [08:33:44] PROBLEM - Varnish HTTP upload-frontend - port 3125 on cp4022 is CRITICAL: connect to address 10.128.0.122 and port 3125: Connection refused [08:33:56] that's me, the host is depooled ^ [08:34:24] RECOVERY - Varnish HTTP upload-frontend - port 3127 on cp4022 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.157 second response time [08:34:43] RECOVERY - Varnish HTTP upload-frontend - port 3124 on cp4022 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.157 second response time [08:34:44] RECOVERY - Varnish HTTP upload-frontend - port 80 on cp4022 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.160 second response time [08:34:44] RECOVERY - Varnish HTTP upload-frontend - port 3125 on cp4022 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.157 second response time [08:35:49] !log cache_upload: repool cp4022 (varnish 5) [08:35:58] (03PS1) 10Marostegui: db-eqiad.php: Repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407599 (https://phabricator.wikimedia.org/T162807) [08:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:57] !log apt-get install php5-dbg on phab1001 as attempt to have a better gdb output for T182832 [08:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:12] T182832: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832 [08:39:52] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407599 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [08:41:05] nope, I'd probably need libapache2-mod-php5's dbg symbols too [08:41:55] mmm maybe not, only the ones for the json lib? [08:42:05] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407599 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [08:42:18] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407599 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [08:43:49] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1089 - T162807 (duration: 00m 55s) [08:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:01] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [08:44:42] (03PS1) 10Ema: cache_upload: upgrade cp4023 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/407600 (https://phabricator.wikimedia.org/T180433) [08:52:32] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407601 [08:52:35] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407601 [08:58:37] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407601 (owner: 10Marostegui) [09:00:17] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407601 (owner: 10Marostegui) [09:00:30] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407601 (owner: 10Marostegui) [09:01:52] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1065 - T162807 (duration: 00m 54s) [09:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:03] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [09:06:24] (03CR) 10Ema: [C: 032] cache_upload: upgrade cp4023 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/407600 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [09:08:17] !log cache_upload: upgrade cp4023 to varnish 5 [09:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:26] !log cache_upload: repool cp4023 (varnish 5) [09:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:10] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3940425 (10jcrespo) @Marostegui- this is bad, codfw machine was created as tendril2001 T186123, and this was called db1115. This is not a terrible name for... [09:18:49] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3940433 (10Marostegui) >>! In T185788#3940425, @jcrespo wrote: > @Marostegui- this is bad, codfw machine was created as tendril2001 T186123, and this was ca... [09:21:34] (03PS1) 10Marostegui: install_server: Replace db1115 with tendril1001 [puppet] - 10https://gerrit.wikimedia.org/r/407605 (https://phabricator.wikimedia.org/T185788) [09:24:16] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup tendril1001 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3940445 (10Marostegui) >>! In T185788#3940441, @jcrespo wrote: >> Suggestions? > > The easy thing would be call it tendril1001 (which I do not 100% li... [09:26:51] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup tendril1001 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3940447 (10jcrespo) > To be honest, I would call it as a normal database name, to avoid making any kind of exception and having dedicated hostnames Wh... [09:27:46] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup tendril1001 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3940448 (10Marostegui) >>! In T185788#3940447, @jcrespo wrote: >> To be honest, I would call it as a normal database name, to avoid making any kind of... [09:36:01] (03PS1) 10Gilles: Give officewiki read access to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/407608 (https://phabricator.wikimedia.org/T169144) [09:36:33] (03CR) 10jerkins-bot: [V: 04-1] Give officewiki read access to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/407608 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [09:36:40] (03Abandoned) 10Marostegui: install_server: Replace db1115 with tendril1001 [puppet] - 10https://gerrit.wikimedia.org/r/407605 (https://phabricator.wikimedia.org/T185788) (owner: 10Marostegui) [09:37:11] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3940476 (10Marostegui) [09:38:12] 10Operations, 10Patch-For-Review, 10User-fgiunchedi: provide a pxe-bootable rescue image - https://phabricator.wikimedia.org/T78135#3940478 (10fgiunchedi) [09:41:10] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3940489 (10jcrespo) Note for that one means involving papaul and renaming stuff, from the physical label to racktables, to dns, etc. [09:41:14] (03PS1) 10Marostegui: install_server: Change db.cfg with raid1-gpt.cfg [puppet] - 10https://gerrit.wikimedia.org/r/407610 (https://phabricator.wikimedia.org/T185788) [09:41:37] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3940491 (10Marostegui) >>! In T185788#3940489, @jcrespo wrote: > Note for that one means involving papaul and renaming stuff, from the physical label to rac... [09:43:09] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3940492 (10jcrespo) All db hosts will have a hw RAID except these, it will be confusing. [09:44:41] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3940493 (10Marostegui) >>! In T185788#3940492, @jcrespo wrote: > All db hosts will have a hw RAID except these, it will be confusing. ok - I am going to st... [09:46:35] !log Add thumborUrl to Swift config in PrivateSettings.php [09:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:07] (03PS1) 10Gilles: Proxy public wiki thumb.php wikis through Thumbor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407611 (https://phabricator.wikimedia.org/T169144) [09:50:51] (03PS2) 10Gilles: Proxy public wiki thumb.php requests through Thumbor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407611 (https://phabricator.wikimedia.org/T169144) [09:56:34] (03PS2) 10Gilles: Give officewiki read access to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/407608 (https://phabricator.wikimedia.org/T169144) [09:57:01] (03CR) 10jerkins-bot: [V: 04-1] Give officewiki read access to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/407608 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [09:57:08] !log roll-upgrade thumbor to 1.11 - T178072 T185478 T185483 T185485 T183907 T179954 [09:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:26] T185483: Measure time spent by Thumbor connecting to and reading from Memcache - https://phabricator.wikimedia.org/T185483 [09:57:26] T185478: Measure time spent by Thumbor connecting to and reading from Swift - https://phabricator.wikimedia.org/T185478 [09:57:26] T185485: Measure time spent by Thumbor connecting to and reading from Poolcounter - https://phabricator.wikimedia.org/T185485 [09:57:26] T179954: Thumbor errors should contain a trackable request id - https://phabricator.wikimedia.org/T179954 [09:57:26] T183907: Thumbor 500 while thumbnailing some webm files - https://phabricator.wikimedia.org/T183907 [09:57:26] T178072: Thumbor: Error reading image metadata: Failed to read image data - https://phabricator.wikimedia.org/T178072 [09:57:33] thanks stashbot [09:58:42] (03CR) 10Gilles: "The linting error seems incorrect..." [puppet] - 10https://gerrit.wikimedia.org/r/407608 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [10:00:14] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3940555 (10Marostegui) 05Open>03stalled [10:18:28] !log installing ruby security updates on trusty [10:18:35] (03PS1) 10Ema: cache_upload: upgrade cp4024 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/407616 (https://phabricator.wikimedia.org/T180433) [10:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:28] (03CR) 10Ema: [C: 032] cache_upload: upgrade cp4024 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/407616 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [10:20:03] !log cache_upload: upgrade cp4024 to varnish 5 [10:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:25] (03CR) 10Thiemo Kreuz (WMDE): [C: 031] "This got a "Go!" from PM already, see T186107." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407017 (https://phabricator.wikimedia.org/T186107) (owner: 10Zoranzoki21) [10:24:37] !log cache_upload: repool cp4024 (varnish 5) [10:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:50] (03PS2) 10Marostegui: install_server: Change db.cfg with raid1-gpt.cfg [puppet] - 10https://gerrit.wikimedia.org/r/407610 (https://phabricator.wikimedia.org/T185788) [10:35:31] (03PS1) 10Ema: cache_upload: upgrade cp4025 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/407620 (https://phabricator.wikimedia.org/T180433) [10:35:33] (03PS1) 10Ema: cache_upload: upgrade ulsfo to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/407621 (https://phabricator.wikimedia.org/T180433) [10:35:35] (03CR) 10Marostegui: [C: 032] install_server: Change db.cfg with raid1-gpt.cfg [puppet] - 10https://gerrit.wikimedia.org/r/407610 (https://phabricator.wikimedia.org/T185788) (owner: 10Marostegui) [10:37:18] (03PS1) 10Marostegui: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407622 (https://phabricator.wikimedia.org/T162807) [10:37:50] (03PS2) 10Ema: cache_upload: upgrade cp4025 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/407620 (https://phabricator.wikimedia.org/T180433) [10:38:16] (03CR) 10Ema: [V: 032 C: 032] cache_upload: upgrade cp4025 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/407620 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [10:39:05] !log cache_upload: upgrade cp4025 to varnish 5 [10:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:19] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407622 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [10:40:46] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407622 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [10:40:57] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407622 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [10:41:56] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1067 - T162807 (duration: 00m 55s) [10:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:09] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [10:43:46] !log cache_upload: repool cp4025 (varnish 5) [10:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:27] (03CR) 10Filippo Giunchedi: "> The linting error seems incorrect..." [puppet] - 10https://gerrit.wikimedia.org/r/407608 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [11:06:31] (03PS2) 10Ema: cache_upload: upgrade ulsfo to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/407621 (https://phabricator.wikimedia.org/T180433) [11:07:02] (03CR) 10Ema: [V: 032 C: 032] cache_upload: upgrade ulsfo to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/407621 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [11:07:59] !log cache_upload: upgrade cp4026 to varnish 5 [11:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:28] !log cache_upload: repool cp4026 (varnish 5) [11:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:53] 10Operations, 10MediaWiki-Vagrant, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review: php-luasandbox in Wikimedia's Stretch apt repo depends on php5 - https://phabricator.wikimedia.org/T183888#3940663 (10MoritzMuehlenhoff) I've uploaded a backport of Kunal's 1.5.1-3 package from Debian testing to stretc... [11:21:30] (03CR) 10Muehlenhoff: [WIP] php7 manifests for mediawiki on stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [11:27:04] 10Operations, 10Packaging: rebuild php-wikidiff2 and php-luasandbox for php7 and stretch - https://phabricator.wikimedia.org/T184270#3940702 (10Legoktm) 05Open>03Resolved Both wikidiff2 and luasandbox are now in stretch-backports. [11:27:07] 10Operations, 10MediaWiki-Vagrant, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review: php-luasandbox in Wikimedia's Stretch apt repo depends on php5 - https://phabricator.wikimedia.org/T183888#3940705 (10Legoktm) [11:30:07] 10Operations, 10Packaging: rebuild php-wikidiff2 and php-luasandbox for php7 and stretch - https://phabricator.wikimedia.org/T184270#3940712 (10MoritzMuehlenhoff) In addition I'll drop the php-wikidiff2 from our internal src:php-wikidiff2 package (so that it only builds hhvm-wikidiff2). [11:31:53] PROBLEM - MegaRAID on db1051 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [11:32:50] ^ that is known I will ack it [11:32:58] and downtime it [11:33:39] ACKNOWLEDGEMENT - MegaRAID on db1051 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough Marostegui T186049 - The acknowledgement expires at: 2018-02-09 11:33:09. [11:38:19] 10Operations, 10wikitech.wikimedia.org: Remove cloud-admin rights from YuviPanda - https://phabricator.wikimedia.org/T186289#3940719 (10MarcoAurelio) [11:39:58] !log uploaded php-wikidiff2 1.5.1+deb9u2 to apt.wikimedia.org (despite the source package name, this package only builds hhvm-wikidiff2 now as php-wikidiff2 is instead updated via stretch-backports, the old internal package will eventually be phased out when we move to PHP7) [11:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:44] 10Operations, 10wikitech.wikimedia.org: Remove cloud-admin rights from YuviPanda - https://phabricator.wikimedia.org/T186289#3939839 (10MarcoAurelio) As for wikitech, any user from https://wikitech.wikimedia.org/wiki/Special:ListUsers?group=cloudadmin can do that. [11:41:53] RECOVERY - MegaRAID on db1051 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [11:42:07] how is that possible? [11:43:55] I did a re-learn [11:43:57] (03PS2) 10Filippo Giunchedi: Add Thumbor-Request-Id generated by nginx [puppet] - 10https://gerrit.wikimedia.org/r/407411 (https://phabricator.wikimedia.org/T179954) (owner: 10Gilles) [11:44:14] based on past experience, it may fail again soon :-/ [11:44:19] it will [11:46:31] (03CR) 10Joal: [C: 031] "Looks good to me, one question inline :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/406951 (https://phabricator.wikimedia.org/T181036) (owner: 10Elukey) [11:47:07] (03CR) 10Filippo Giunchedi: [C: 032] Add Thumbor-Request-Id generated by nginx [puppet] - 10https://gerrit.wikimedia.org/r/407411 (https://phabricator.wikimedia.org/T179954) (owner: 10Gilles) [11:52:14] 10Operations, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864#3940724 (10MoritzMuehlenhoff) mailman3-core, mailman3-hyperkitty, postorius and mailmanclient have been accepted into stretch-backports today. [11:57:24] !log roll-restart nginx on thumbor and swift-proxy on ms-fe to apply https://gerrit.wikimedia.org/r/407411 [11:57:31] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407627 [11:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:25] 10Operations, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864#3940731 (10MarcoAurelio) Does that mean we can start considering our migration? [12:05:15] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407627 (owner: 10Marostegui) [12:05:31] 10Operations, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864#3940735 (10MoritzMuehlenhoff) That's dependent on goal planning / road map considerations, I only meant to point out the availability in backports sinc... [12:06:56] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407627 (owner: 10Marostegui) [12:07:15] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407627 (owner: 10Marostegui) [12:08:00] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1067 - T162807 (duration: 00m 55s) [12:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:14] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [12:11:02] 10Operations, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864#3940741 (10Elitre) >>! In T52864#3940724, @MoritzMuehlenhoff wrote: > mailman3-core, mailman3-hyperkitty, postorius and mailmanclient have been acce... [12:15:06] 10Operations, 10MediaWiki-Vagrant, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review: php-luasandbox in Wikimedia's Stretch apt repo depends on php5 - https://phabricator.wikimedia.org/T183888#3940749 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff Our internal wikidiff2 package has... [12:31:02] (03CR) 10Muehlenhoff: [C: 04-1] Remove access for myself (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/407577 (owner: 10Yuvipanda) [12:34:20] (03CR) 10Muehlenhoff: [C: 04-1] "And BTW, you're also listed in modules/nagios_common/files/contactgroups.cfg; do you want to be kept in the sms contact group? And" [puppet] - 10https://gerrit.wikimedia.org/r/407577 (owner: 10Yuvipanda) [12:37:54] and now db1070 [12:55:43] PROBLEM - MegaRAID on db1070 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [12:55:44] ACKNOWLEDGEMENT - MegaRAID on db1070 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T186319 [12:55:48] 10Operations, 10ops-eqiad: Degraded RAID on db1070 - https://phabricator.wikimedia.org/T186319#3940831 (10ops-monitoring-bot) [12:56:28] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1070 - https://phabricator.wikimedia.org/T186319#3940835 (10Marostegui) This is s5 master [12:56:56] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1070 - https://phabricator.wikimedia.org/T186319#3940837 (10Marostegui) p:05Triage>03High [12:57:23] (03CR) 10Elukey: profile::analytics::refinery::job::camus: add netflow hourly job (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/406951 (https://phabricator.wikimedia.org/T181036) (owner: 10Elukey) [12:57:42] !log installing updated kernels on remaining jessie DB servers [12:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:13] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1070 - https://phabricator.wikimedia.org/T186319#3940831 (10Marostegui) a:03Cmjohnson @Cmjohnson this host is out of warranty Can we replace its disk as soon as possible - if possible before the weekend comes? [13:08:22] (03PS7) 10Elukey: profile::analytics::refinery::job::camus: add netflow hourly job [puppet] - 10https://gerrit.wikimedia.org/r/406951 (https://phabricator.wikimedia.org/T181036) [13:09:50] 10Operations, 10ops-eqiad, 10hardware-requests: decom iridium - https://phabricator.wikimedia.org/T172487#3940856 (10MoritzMuehlenhoff) >>! In T172487#3902655, @Dzahn wrote: > No, no point i debugging indeed. Instead it would be really nice if it could be shutdown after running such a long time doing nothing... [13:16:14] !log installing w3m security updates on trusty [13:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:08] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3940874 (10Paladox) Is it possible that phab carnt keep up with everyone conn... [13:20:00] (03PS8) 10Elukey: profile::analytics::refinery::job::camus: add netflow hourly job [puppet] - 10https://gerrit.wikimedia.org/r/406951 (https://phabricator.wikimedia.org/T181036) [13:44:33] 10Operations, 10Analytics-Kanban, 10User-Elukey: Expand meitnerium's root partition to 100G - https://phabricator.wikimedia.org/T186020#3940931 (10elukey) Is it a simple gnt-instance reboot meitnerium.wikimedia.org right? [13:48:16] (03PS1) 10Marostegui: db1100: Swtiching it to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/407633 (https://phabricator.wikimedia.org/T186321) [13:49:41] (03CR) 10Jcrespo: [C: 031] db1100: Swtiching it to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/407633 (https://phabricator.wikimedia.org/T186321) (owner: 10Marostegui) [13:49:49] (03PS1) 10Marostegui: db-eqiad.php: Clarifying that db1100 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407635 (https://phabricator.wikimedia.org/T186321) [13:50:44] (03CR) 10Jcrespo: [C: 031] "Will we need to depool it for a restart and/or hot reconfig?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407635 (https://phabricator.wikimedia.org/T186321) (owner: 10Marostegui) [13:51:47] (03CR) 10Marostegui: "> Will we need to depool it for a restart and/or hot reconfig?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407635 (https://phabricator.wikimedia.org/T186321) (owner: 10Marostegui) [13:53:49] 10Operations, 10ops-eqiad, 10hardware-requests: decom iridium - https://phabricator.wikimedia.org/T172487#3940941 (10Dzahn) Yes, i can't do them though because i don't have the access to disable switch ports. [13:55:26] 10Operations, 10Analytics-Kanban, 10User-Elukey: Expand meitnerium's root partition to 100G - https://phabricator.wikimedia.org/T186020#3940943 (10Dzahn) Yea, or gnt-instance shutdown gnt-instance startup [13:59:10] (03PS1) 10Muehlenhoff: Add library hint for librdkafka [puppet] - 10https://gerrit.wikimedia.org/r/407636 [13:59:48] !log reboot meitnerium via gnt-instance reboot on ganeti1005 to pick up new disk config - T184794 [14:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:00] T184794: Fix outstanding bugs preventing the use of prometheus jmx agent for Hive/Oozie - https://phabricator.wikimedia.org/T184794 [14:00:25] course the task is T186020 not this one [14:00:26] T186020: Expand meitnerium's root partition to 100G - https://phabricator.wikimedia.org/T186020 [14:00:39] (03CR) 10Jcrespo: [C: 031] "There is a typo on the title- it should also be changed per style guide to "...:Switch it ..."" [puppet] - 10https://gerrit.wikimedia.org/r/407633 (https://phabricator.wikimedia.org/T186321) (owner: 10Marostegui) [14:00:44] (03CR) 10Dzahn: "decom steps says it needs to happen after switch ports are disabled and i can't disable switch ports, so i'll have to leave this to Rob to" [dns] - 10https://gerrit.wikimedia.org/r/407173 (owner: 10Papaul) [14:01:24] (03PS2) 10Dzahn: DNS: Add production DNS entry for tendril2001 [dns] - 10https://gerrit.wikimedia.org/r/407454 (owner: 10Papaul) [14:01:27] (03CR) 10Jcrespo: [C: 031] "...: Clarify db1100..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407635 (https://phabricator.wikimedia.org/T186321) (owner: 10Marostegui) [14:03:16] jynus: is dbmonitor2001 not the same as tendril2001 ? [14:04:41] (03CR) 10Muehlenhoff: [C: 032] Add library hint for librdkafka [puppet] - 10https://gerrit.wikimedia.org/r/407636 (owner: 10Muehlenhoff) [14:05:46] mutante: one is the frontend, and another is the backed [14:06:05] jynus: ah !:) [14:06:16] but there is no "tendril1001" [14:06:33] well, it is not clear right now if it will be [14:06:46] will that happen later when db1011 will be replaced? [14:06:55] what I think is we shouldn't have different names for the same thing on different dcs [14:07:16] we can have if we want different names for backend and frontend, if we want [14:07:28] is tendril2001 the equivalent of db1011 and will use role(mariadb::tendril) ? [14:07:41] yes [14:07:44] ok, gotcha [14:08:00] i guess it means we should at some point rename that [14:08:20] the thing is we either rename tendril to db, or db to tendril, or both to something else [14:08:51] tendril1001/db1115 has not yet been setup, so better decide now [14:08:56] i can still just _not_ merge that change above that adds tendril2001 [14:08:56] tendril is not a good name [14:09:02] so far tendril is only a service name [14:09:16] because we may rename tendril service to other thing [14:09:30] that is why I called the frontends dbmonitor [14:09:33] more generic [14:09:57] dbmonitor-be ? [14:09:57] these should be dbmonitordb's, but that is a terrible name [14:10:12] -be is like we do for swift [14:10:30] marostegui may also be mad at me for confusing him [14:10:46] :'-( [14:11:22] renaming dbmonitors also is very easy because they are vms [14:12:08] what do the rest think, is dbmonitor-fe and dbmonitor-be clear? [14:12:41] dbmondb [14:12:47] lol [14:12:58] yes, -be and -fe follows the same pattern we use for others at least [14:13:20] however, we do not do that for other dbs [14:13:28] misc dbs are still dbs [14:13:35] (03CR) 10Dzahn: [C: 04-1] "talked with jynus. we should find a different name for these machines. tendril may be replaced by other software and isn't a good host nam" [dns] - 10https://gerrit.wikimedia.org/r/407454 (owner: 10Papaul) [14:14:04] yea, the other option is of course just to give it a db number like all before [14:14:07] up to you guys [14:14:20] but it seems clear we should not use tendril2001 then [14:14:34] those db hosts may be also used to host a snall, private installation of prometheus [14:14:42] for private queries [14:15:41] faidon had some issues with so many db* [14:16:05] maybe it would be a would time to use db* only for mediawiki metadata databases [14:16:16] and setup miscdb* hosts [14:16:52] although another thing is that we change those sometimes [14:17:10] while new tendril hosts have specific diffferent hardware [14:20:14] PROBLEM - Host etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [14:20:14] PROBLEM - Host fermium is DOWN: PING CRITICAL - Packet loss = 100% [14:20:54] PROBLEM - Host webperf1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:00] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install tendril2001 - https://phabricator.wikimedia.org/T186123#3934359 (10Dzahn) After talking with jcrespo on IRC: We should use a different name for this system. So far tendril is only a service name, not a host name. And tendril might be... [14:21:03] PROBLEM - Host mwdebug1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:03] PROBLEM - Host ununpentium is DOWN: PING CRITICAL - Packet loss = 100% [14:21:13] PROBLEM - Host releases1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:13] PROBLEM - Host kubestagetcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:23] PROBLEM - Host logstash1008 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:54] I was copying data from /var/lib/archiva on meitnerium to another mount point, the host froze so I guess that this mess is due to me [14:22:13] PROBLEM - ganeti-noded running on ganeti1005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded [14:22:46] I didn't really expect this issue [14:23:04] RECOVERY - ganeti-noded running on ganeti1005 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded [14:23:38] it's the same error we'd been seeing before [14:23:52] https://phabricator.wikimedia.org/T181121 [14:24:07] moritzm: Thanks for https://phabricator.wikimedia.org/T52864#3940724 :-) That's interesting news. [14:24:34] PROBLEM - Request latencies on argon is CRITICAL: CRITICAL - apiserver_request_latencies is 644969 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:25:11] moritzm: ganeti1005 works fine, no idea how to restore the instances.. [14:25:27] mhh, this might be somewhat different, though, there's various OOM killer logs for qemu processes, don't think we've seen that before in T181121 [14:25:28] T181121: Hardware errors on ganeti1005- ganeti1008 - https://phabricator.wikimedia.org/T181121 [14:25:33] PROBLEM - etcd request latencies on argon is CRITICAL: CRITICAL - etcd_request_latencies is 559363 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:25:34] RECOVERY - Request latencies on argon is OK: OK - apiserver_request_latencies is 5671 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:25:43] RECOVERY - Host ununpentium is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [14:25:43] RECOVERY - Host logstash1008 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [14:25:47] ah ok [14:25:53] RECOVERY - Host mwdebug1002 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [14:25:53] RECOVERY - Host releases1001 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [14:25:55] yeah, they should recover soonish [14:26:02] so cpu and load went up when I started the cp [14:26:05] RECOVERY - Host kubestagetcd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [14:26:05] RECOVERY - Host webperf1001 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [14:26:13] RECOVERY - Host etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 1.34 ms [14:26:13] RECOVERY - Host fermium is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [14:26:23] I'll poke at logs and compare to what we collected at T181121 [14:26:33] RECOVERY - etcd request latencies on argon is OK: OK - etcd_request_latencies is 3828 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:28:41] 10Operations, 10Analytics-Kanban, 10User-Elukey: Expand meitnerium's root partition to 100G - https://phabricator.wikimedia.org/T186020#3940974 (10elukey) New disk in place, added ext4 and everything looks good. I mounted /dev/vdb1 to /mnt/archiva and started a cp -a from /var/lib/archiva to that dir, but ga... [14:29:14] I am going to stop for today working on that host :( [14:35:22] (03PS2) 10Arturo Borrero Gonzalez: WIP: apt: merge script report-pending-upgrades to apt-upgrade [puppet] - 10https://gerrit.wikimedia.org/r/407465 (https://phabricator.wikimedia.org/T181647) [14:35:28] (03CR) 10jerkins-bot: [V: 04-1] WIP: apt: merge script report-pending-upgrades to apt-upgrade [puppet] - 10https://gerrit.wikimedia.org/r/407465 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [14:40:22] (03CR) 10Elukey: [C: 032] README.md: add a note about glibc depencency [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/407587 (https://phabricator.wikimedia.org/T186169) (owner: 10Elukey) [14:40:30] 10Operations, 10Analytics-Kanban, 10User-Elukey: Expand meitnerium's root partition to 100G - https://phabricator.wikimedia.org/T186020#3940981 (10akosiaris) How nice :(. But it does look like disk IO is a possible reproduction scenario for T181121. I 'll empty ganeti1005 to avoid having any worse problems d... [14:40:33] PROBLEM - etcd request latencies on argon is CRITICAL: CRITICAL - etcd_request_latencies is 4683890 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:41:06] akosiaris: sorry for the trouble :( [14:41:33] RECOVERY - etcd request latencies on argon is OK: OK - etcd_request_latencies is 3703 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:42:25] 10Operations, 10ops-eqiad: Hardware errors on ganeti1005- ganeti1008 - https://phabricator.wikimedia.org/T181121#3940985 (10akosiaris) During some disk IO in T186020, ganeti1005 exhibited the usual symptoms. This hasn't been triggered for ~1 month so maybe we have a reproduction scenario in some heavy IO ? An... [14:43:07] 10Operations, 10ops-eqiad: Hardware errors on ganeti1005- ganeti1008 - https://phabricator.wikimedia.org/T181121#3940987 (10MoritzMuehlenhoff) Happened again on ganeti1005, similar errors, but this time triggered by a copy of the Archiva data. [14:51:52] (03CR) 10Alexandros Kosiaris: [C: 04-2] "This is a git submodule (there are more, listed at https://phabricator.wikimedia.org/source/operations-puppet/browse/production/.gitmodule" [puppet] - 10https://gerrit.wikimedia.org/r/407518 (https://phabricator.wikimedia.org/T186268) (owner: 10Herron) [14:53:04] !log reboot ganeti1005 after emptying it. T181121 [14:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:17] T181121: Hardware errors on ganeti1005- ganeti1008 - https://phabricator.wikimedia.org/T181121 [14:54:23] PROBLEM - Host ganeti1005 is DOWN: PING CRITICAL - Packet loss = 100% [14:55:03] RECOVERY - Host ganeti1005 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [14:58:20] (03CR) 10Muehlenhoff: "But it would be great to simply fold nginx into the main operations puppet.git, having that as a separate sub module is probably entirely " [puppet] - 10https://gerrit.wikimedia.org/r/407518 (https://phabricator.wikimedia.org/T186268) (owner: 10Herron) [15:02:52] (03CR) 10Alexandros Kosiaris: [C: 04-2] "FWIW, I think so too. We 've had this discussion multiple times and I think we agree that we haven't really seen any discernible advantage" [puppet] - 10https://gerrit.wikimedia.org/r/407518 (https://phabricator.wikimedia.org/T186268) (owner: 10Herron) [15:06:45]