[00:01:37] 10Operations, 10Beta-Cluster-Infrastructure, 10Wikimedia-General-or-Unknown: Beta English Wikipedia: History of the page 'Bird' generates a 500 or 503 error - https://phabricator.wikimedia.org/T185969#3929818 (10Paladox) I can confirm that it shows a 503 when logged in. [00:05:38] 10Operations, 10ops-codfw, 10netops: rack spare switches in c1-codfw - https://phabricator.wikimedia.org/T185336#3939709 (10ayounsi) 05Open>03Resolved OS upgraded to 14.1X53-D43.7. No system alarms. Configuration zeroized. [00:06:05] jouncebot: refresh [00:06:08] I refreshed my knowledge about deployments. [00:06:44] 10Operations, 10Analytics-Data-Quality, 10Traffic: Vet reliability of the response_size field for data analysis purposes - https://phabricator.wikimedia.org/T185350#3939712 (10Dzahn) p:05Triage>03Normal @Tbayer purely from a ticket triaging perspective: since the ticket title is "vet reliability of the... [00:09:39] 10Operations, 10ops-eqiad, 10Analytics: rack/setup/install notebook100[34] - https://phabricator.wikimedia.org/T183935#3939716 (10RobH) a:05RobH>03elukey [00:10:40] too many open tasks misasisgning them between people =P [00:10:41] 10Operations, 10ops-eqiad, 10Analytics: rack/setup/install notebook100[34] - https://phabricator.wikimedia.org/T183935#3867856 (10RobH) a:05elukey>03Gehel These are finishing their initial puppet runs and are ready to be pushed into service role. Escalating to @elukey. [00:10:44] 10Operations, 10ops-eqiad, 10Analytics: rack/setup/install notebook100[34] - https://phabricator.wikimedia.org/T183935#3939722 (10RobH) a:05Gehel>03elukey [00:15:14] 10Operations, 10ops-esams, 10netops: replace msw1-esams - https://phabricator.wikimedia.org/T185151#3939723 (10Dzahn) p:05Triage>03Normal [00:17:07] (03CR) 10Dzahn: [C: 031] "+1 in the literal "needs somebody else to approve" sense. tiny nitpick there is a typo in the comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/399101 (owner: 10Ayounsi) [00:19:45] 10Operations, 10ops-esams, 10Epic: SRE 2017-18 Q3 goal Cleanup esams and refresh servers and infrastructure (tracking) - https://phabricator.wikimedia.org/T184061#3871139 (10ayounsi) Mentioning T185151 here as well so it's on your radar (doesn't have to be Q3). [00:29:18] (03PS6) 10Madhuvishy: NFS: add custom script to generate target hosts [puppet] - 10https://gerrit.wikimedia.org/r/406779 (https://phabricator.wikimedia.org/T185967) (owner: 10Volans) [00:29:46] (03CR) 10jerkins-bot: [V: 04-1] NFS: add custom script to generate target hosts [puppet] - 10https://gerrit.wikimedia.org/r/406779 (https://phabricator.wikimedia.org/T185967) (owner: 10Volans) [00:30:49] (03PS7) 10Madhuvishy: NFS: add custom script to generate target hosts [puppet] - 10https://gerrit.wikimedia.org/r/406779 (https://phabricator.wikimedia.org/T185967) (owner: 10Volans) [00:34:29] 10Operations, 10Analytics-Kanban, 10User-Elukey: Expand meitnerium's root partition to 100G - https://phabricator.wikimedia.org/T186020#3939747 (10Dzahn) ``` .. Fri Feb 2 00:23:13 2018 - INFO: - device disk/1: 99.30% done, 21s remaining (estimated) Fri Feb 2 00:23:34 2018 - INFO: - device disk/1: 100.00%... [00:38:25] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [00:39:03] 10Operations, 10Analytics-Kanban, 10User-Elukey: Expand meitnerium's root partition to 100G - https://phabricator.wikimedia.org/T186020#3939748 (10Dzahn) @elukey so yea, now we'd have to restart the instance from ganeti, as the comment above says rebooting from within the instance won't do it. You said above... [00:40:35] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 22 probes of 290 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [00:45:26] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 11 probes of 290 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [01:13:25] (03PS1) 10Yuvipanda: Remove access for myself [puppet] - 10https://gerrit.wikimedia.org/r/407577 [01:13:30] 10Operations: upgrade all Ubuntu (trusty) hosts in production - https://phabricator.wikimedia.org/T186288#3939825 (10Dzahn) [01:20:46] 10Operations, 10cloud-services-team: replace all Ubuntu (trusty) hosts in production with Debian - https://phabricator.wikimedia.org/T186288#3939825 (10Dzahn) [01:23:14] 10Operations, 10cloud-services-team: replace all Ubuntu (trusty) hosts in production with Debian - https://phabricator.wikimedia.org/T186288#3939859 (10Dzahn) [01:35:10] (03Abandoned) 10Yuvipanda: labs: Only include nfsclient if *any* nfs mounts are enabled [puppet] - 10https://gerrit.wikimedia.org/r/333227 (owner: 10Yuvipanda) [01:42:31] 10Operations, 10cloud-services-team: replace all Ubuntu (trusty) hosts in production with Debian - https://phabricator.wikimedia.org/T186288#3939867 (10bd808) [01:43:24] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [01:44:43] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:48:33] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 2.09 ms [01:49:53] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 3.17 ms [02:39:51] (03CR) 10Legoktm: ":(" [puppet] - 10https://gerrit.wikimedia.org/r/407577 (owner: 10Yuvipanda) [02:49:58] 10Operations, 10Analytics-Data-Quality, 10Traffic: Vet reliability of the response_size field for data analysis purposes - https://phabricator.wikimedia.org/T185350#3940039 (10BBlack) @Dzahn I was planning to follow up a bit on some of the remaining questions above, just haven't gotten there yet :) [03:24:33] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 831.32 seconds [03:52:43] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 198.12 seconds [04:06:54] 10Operations, 10Analytics, 10Research, 10Traffic, and 6 others: Referrer policy for browsers which only support the old spec - https://phabricator.wikimedia.org/T180921#3940150 (10Tgr) @Nuria any thoughts about the next step? Should we just enable the fallback and check the data to see if it had any unexpe... [04:30:23] PROBLEM - HHVM rendering on mw2115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:31:13] PROBLEM - SSH cp3038.mgmt on cp3038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:31:17] RECOVERY - HHVM rendering on mw2115 is OK: HTTP OK: HTTP/1.1 200 OK - 79949 bytes in 0.294 second response time [05:00:43] PROBLEM - puppet last run on baham is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:01:03] PROBLEM - puppet last run on logstash1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:02:07] PROBLEM - puppet last run on mw1312 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:02:36] PROBLEM - puppet last run on rdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:02:43] PROBLEM - puppet last run on aqs1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:02:43] PROBLEM - puppet last run on elastic1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:02:44] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:02:53] PROBLEM - puppet last run on boron is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:03:13] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:03:23] PROBLEM - puppet last run on conf1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:03:23] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:03:54] PROBLEM - puppet last run on db1097 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:04:23] PROBLEM - puppet last run on mw1275 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:05:13] PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:06:26] puppetdb on nitrogen killed by oom --^ [05:06:29] [Fri Feb 2 04:57:52 2018] Killed process 20908 (java) total-vm:12292464kB, anon-rss:6269300kB, file-rss:0kB, shmem-rss:0kB [05:07:23] elukey: o/ [05:07:51] hello ema! <3 [05:11:13] elukey: anything to do when puppetdb gets shot, or does it recover on its own? [05:12:36] ema: systemd restarts it and the next puppet runs recovers automagically [05:13:32] ok! [05:28:54] RECOVERY - puppet last run on db1097 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [05:29:23] RECOVERY - puppet last run on mw1275 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:30:13] RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:30:43] RECOVERY - puppet last run on baham is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:31:03] RECOVERY - puppet last run on logstash1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [05:31:04] RECOVERY - SSH cp3038.mgmt on cp3038.mgmt is OK: SSH OK - OpenSSH_5.8 (protocol 2.0) [05:32:03] RECOVERY - puppet last run on mw1312 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [05:32:33] RECOVERY - puppet last run on rdb1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [05:32:43] RECOVERY - puppet last run on aqs1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [05:32:43] RECOVERY - puppet last run on elastic1048 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [05:32:44] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [05:32:53] RECOVERY - puppet last run on boron is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [05:33:14] RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [05:33:23] RECOVERY - puppet last run on conf1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [05:33:23] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [05:37:17] !log truncate /var/log/aphlict/aphlict.log to 25G as temp measure to avoid phab1001's root partition to fill up [05:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:10] (03PS1) 10Elukey: phabricator: fix aphlict's logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/407583 [05:57:05] (03CR) 10Elukey: [C: 032] phabricator: fix aphlict's logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/407583 (owner: 10Elukey) [05:58:54] for some reason pcc didn't show any diff for --^ [06:10:35] (03PS2) 10Dzahn: DHCP: Add MAC address entry for tendril2001 [puppet] - 10https://gerrit.wikimedia.org/r/407457 (https://phabricator.wikimedia.org/T186123) (owner: 10Papaul) [06:13:16] (03CR) 10Dzahn: [C: 032] "racadm getsysinfo" [puppet] - 10https://gerrit.wikimedia.org/r/407457 (https://phabricator.wikimedia.org/T186123) (owner: 10Papaul) [06:13:23] (03PS3) 10Dzahn: DHCP: Add MAC address entry for tendril2001 [puppet] - 10https://gerrit.wikimedia.org/r/407457 (https://phabricator.wikimedia.org/T186123) (owner: 10Papaul) [06:16:56] (03CR) 10Dzahn: "thank you for this :) +20after4 +paladox" [puppet] - 10https://gerrit.wikimedia.org/r/407583 (owner: 10Elukey) [06:25:00] /away [06:27:43] PROBLEM - graphite.wikimedia.org on graphite1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.002 second response time [06:28:43] RECOVERY - graphite.wikimedia.org on graphite1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 0.009 second response time [06:52:07] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3940279 (10elukey) I picked one of the httpd processes from server-status in... [06:53:44] (03CR) 10Elukey: "Can I just truncate the file to 1G or does it contain useful things? Now it is 25G :)" [puppet] - 10https://gerrit.wikimedia.org/r/407583 (owner: 10Elukey) [07:03:48] (03PS1) 10Marostegui: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407585 (https://phabricator.wikimedia.org/T162807) [07:08:09] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407585 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [07:09:50] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407585 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [07:10:00] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407585 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [07:10:26] !log Fixing data drifts on db1065 - T162807 [07:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:41] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [07:11:09] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1065 - T162807 (duration: 00m 55s) [07:11:18] 10Operations, 10Analytics-Kanban, 10User-Elukey: Expand meitnerium's root partition to 100G - https://phabricator.wikimedia.org/T186020#3940290 (10elukey) >>! In T186020#3939748, @Dzahn wrote: > @elukey so yea, now we'd have to restart the instance from ganeti, as the comment above says rebooting from within... [07:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:47] (03PS1) 10Marostegui: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407586 (https://phabricator.wikimedia.org/T162807) [07:35:51] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407586 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [07:37:27] (03PS1) 10Elukey: README.md: add a note about glibc depencency [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/407587 (https://phabricator.wikimedia.org/T186169) [07:38:03] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407586 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [07:38:13] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407586 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [07:39:17] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1089 - T162807 (duration: 00m 55s) [07:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:28] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [07:50:46] (03CR) 10Muehlenhoff: README.md: add a note about glibc depencency (031 comment) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/407587 (https://phabricator.wikimedia.org/T186169) (owner: 10Elukey) [07:54:29] (03CR) 10Elukey: README.md: add a note about glibc depencency (031 comment) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/407587 (https://phabricator.wikimedia.org/T186169) (owner: 10Elukey) [07:55:32] (03PS2) 10Elukey: README.md: add a note about glibc depencency [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/407587 (https://phabricator.wikimedia.org/T186169) [07:59:39] (03CR) 10Muehlenhoff: [C: 031] README.md: add a note about glibc depencency [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/407587 (https://phabricator.wikimedia.org/T186169) (owner: 10Elukey) [08:10:51] (03PS1) 10Ema: cache_upload: upgrade cp4021 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/407594 (https://phabricator.wikimedia.org/T180433) [08:12:37] (03CR) 10Ema: [C: 032] cache_upload: upgrade cp4021 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/407594 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [08:14:00] !log cache_upload: upgrade cp4021 to varnish 5 [08:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:15] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3940344 (10Paladox) @elukey thank you for looking into this :). Is our next s... [08:21:06] !log cache_upload: repool cp4021 (varnish 5) [08:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:01] !log installing curl security updates on trusty (Debian already updated) [08:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:13] !log Stop replication in sync db1089 - db1065 - T162807 [08:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:27] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [08:27:56] (03PS1) 10Ema: cache_upload: upgrade cp4022 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/407598 (https://phabricator.wikimedia.org/T180433) [08:29:11] (03CR) 10Ema: [C: 032] cache_upload: upgrade cp4022 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/407598 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [08:29:49] !log cache_upload: upgrade cp4022 to varnish 5 [08:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:24] PROBLEM - Varnish HTTP upload-frontend - port 3127 on cp4022 is CRITICAL: connect to address 10.128.0.122 and port 3127: Connection refused [08:33:43] PROBLEM - Varnish HTTP upload-frontend - port 3124 on cp4022 is CRITICAL: connect to address 10.128.0.122 and port 3124: Connection refused [08:33:44] PROBLEM - Varnish HTTP upload-frontend - port 80 on cp4022 is CRITICAL: connect to address 10.128.0.122 and port 80: Connection refused [08:33:44] PROBLEM - Varnish HTTP upload-frontend - port 3125 on cp4022 is CRITICAL: connect to address 10.128.0.122 and port 3125: Connection refused [08:33:56] that's me, the host is depooled ^ [08:34:24] RECOVERY - Varnish HTTP upload-frontend - port 3127 on cp4022 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.157 second response time [08:34:43] RECOVERY - Varnish HTTP upload-frontend - port 3124 on cp4022 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.157 second response time [08:34:44] RECOVERY - Varnish HTTP upload-frontend - port 80 on cp4022 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.160 second response time [08:34:44] RECOVERY - Varnish HTTP upload-frontend - port 3125 on cp4022 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.157 second response time [08:35:49] !log cache_upload: repool cp4022 (varnish 5) [08:35:58] (03PS1) 10Marostegui: db-eqiad.php: Repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407599 (https://phabricator.wikimedia.org/T162807) [08:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:57] !log apt-get install php5-dbg on phab1001 as attempt to have a better gdb output for T182832 [08:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:12] T182832: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832 [08:39:52] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407599 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [08:41:05] nope, I'd probably need libapache2-mod-php5's dbg symbols too [08:41:55] mmm maybe not, only the ones for the json lib? [08:42:05] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407599 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [08:42:18] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407599 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [08:43:49] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1089 - T162807 (duration: 00m 55s) [08:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:01] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [08:44:42] (03PS1) 10Ema: cache_upload: upgrade cp4023 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/407600 (https://phabricator.wikimedia.org/T180433) [08:52:32] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407601 [08:52:35] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407601 [08:58:37] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407601 (owner: 10Marostegui) [09:00:17] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407601 (owner: 10Marostegui) [09:00:30] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407601 (owner: 10Marostegui) [09:01:52] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1065 - T162807 (duration: 00m 54s) [09:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:03] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [09:06:24] (03CR) 10Ema: [C: 032] cache_upload: upgrade cp4023 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/407600 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [09:08:17] !log cache_upload: upgrade cp4023 to varnish 5 [09:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:26] !log cache_upload: repool cp4023 (varnish 5) [09:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:10] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3940425 (10jcrespo) @Marostegui- this is bad, codfw machine was created as tendril2001 T186123, and this was called db1115. This is not a terrible name for... [09:18:49] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3940433 (10Marostegui) >>! In T185788#3940425, @jcrespo wrote: > @Marostegui- this is bad, codfw machine was created as tendril2001 T186123, and this was ca... [09:21:34] (03PS1) 10Marostegui: install_server: Replace db1115 with tendril1001 [puppet] - 10https://gerrit.wikimedia.org/r/407605 (https://phabricator.wikimedia.org/T185788) [09:24:16] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup tendril1001 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3940445 (10Marostegui) >>! In T185788#3940441, @jcrespo wrote: >> Suggestions? > > The easy thing would be call it tendril1001 (which I do not 100% li... [09:26:51] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup tendril1001 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3940447 (10jcrespo) > To be honest, I would call it as a normal database name, to avoid making any kind of exception and having dedicated hostnames Wh... [09:27:46] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup tendril1001 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3940448 (10Marostegui) >>! In T185788#3940447, @jcrespo wrote: >> To be honest, I would call it as a normal database name, to avoid making any kind of... [09:36:01] (03PS1) 10Gilles: Give officewiki read access to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/407608 (https://phabricator.wikimedia.org/T169144) [09:36:33] (03CR) 10jerkins-bot: [V: 04-1] Give officewiki read access to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/407608 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [09:36:40] (03Abandoned) 10Marostegui: install_server: Replace db1115 with tendril1001 [puppet] - 10https://gerrit.wikimedia.org/r/407605 (https://phabricator.wikimedia.org/T185788) (owner: 10Marostegui) [09:37:11] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3940476 (10Marostegui) [09:38:12] 10Operations, 10Patch-For-Review, 10User-fgiunchedi: provide a pxe-bootable rescue image - https://phabricator.wikimedia.org/T78135#3940478 (10fgiunchedi) [09:41:10] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3940489 (10jcrespo) Note for that one means involving papaul and renaming stuff, from the physical label to racktables, to dns, etc. [09:41:14] (03PS1) 10Marostegui: install_server: Change db.cfg with raid1-gpt.cfg [puppet] - 10https://gerrit.wikimedia.org/r/407610 (https://phabricator.wikimedia.org/T185788) [09:41:37] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3940491 (10Marostegui) >>! In T185788#3940489, @jcrespo wrote: > Note for that one means involving papaul and renaming stuff, from the physical label to rac... [09:43:09] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3940492 (10jcrespo) All db hosts will have a hw RAID except these, it will be confusing. [09:44:41] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3940493 (10Marostegui) >>! In T185788#3940492, @jcrespo wrote: > All db hosts will have a hw RAID except these, it will be confusing. ok - I am going to st... [09:46:35] !log Add thumborUrl to Swift config in PrivateSettings.php [09:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:07] (03PS1) 10Gilles: Proxy public wiki thumb.php wikis through Thumbor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407611 (https://phabricator.wikimedia.org/T169144) [09:50:51] (03PS2) 10Gilles: Proxy public wiki thumb.php requests through Thumbor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407611 (https://phabricator.wikimedia.org/T169144) [09:56:34] (03PS2) 10Gilles: Give officewiki read access to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/407608 (https://phabricator.wikimedia.org/T169144) [09:57:01] (03CR) 10jerkins-bot: [V: 04-1] Give officewiki read access to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/407608 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [09:57:08] !log roll-upgrade thumbor to 1.11 - T178072 T185478 T185483 T185485 T183907 T179954 [09:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:26] T185483: Measure time spent by Thumbor connecting to and reading from Memcache - https://phabricator.wikimedia.org/T185483 [09:57:26] T185478: Measure time spent by Thumbor connecting to and reading from Swift - https://phabricator.wikimedia.org/T185478 [09:57:26] T185485: Measure time spent by Thumbor connecting to and reading from Poolcounter - https://phabricator.wikimedia.org/T185485 [09:57:26] T179954: Thumbor errors should contain a trackable request id - https://phabricator.wikimedia.org/T179954 [09:57:26] T183907: Thumbor 500 while thumbnailing some webm files - https://phabricator.wikimedia.org/T183907 [09:57:26] T178072: Thumbor: Error reading image metadata: Failed to read image data - https://phabricator.wikimedia.org/T178072 [09:57:33] thanks stashbot [09:58:42] (03CR) 10Gilles: "The linting error seems incorrect..." [puppet] - 10https://gerrit.wikimedia.org/r/407608 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [10:00:14] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3940555 (10Marostegui) 05Open>03stalled [10:18:28] !log installing ruby security updates on trusty [10:18:35] (03PS1) 10Ema: cache_upload: upgrade cp4024 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/407616 (https://phabricator.wikimedia.org/T180433) [10:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:28] (03CR) 10Ema: [C: 032] cache_upload: upgrade cp4024 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/407616 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [10:20:03] !log cache_upload: upgrade cp4024 to varnish 5 [10:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:25] (03CR) 10Thiemo Kreuz (WMDE): [C: 031] "This got a "Go!" from PM already, see T186107." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407017 (https://phabricator.wikimedia.org/T186107) (owner: 10Zoranzoki21) [10:24:37] !log cache_upload: repool cp4024 (varnish 5) [10:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:50] (03PS2) 10Marostegui: install_server: Change db.cfg with raid1-gpt.cfg [puppet] - 10https://gerrit.wikimedia.org/r/407610 (https://phabricator.wikimedia.org/T185788) [10:35:31] (03PS1) 10Ema: cache_upload: upgrade cp4025 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/407620 (https://phabricator.wikimedia.org/T180433) [10:35:33] (03PS1) 10Ema: cache_upload: upgrade ulsfo to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/407621 (https://phabricator.wikimedia.org/T180433) [10:35:35] (03CR) 10Marostegui: [C: 032] install_server: Change db.cfg with raid1-gpt.cfg [puppet] - 10https://gerrit.wikimedia.org/r/407610 (https://phabricator.wikimedia.org/T185788) (owner: 10Marostegui) [10:37:18] (03PS1) 10Marostegui: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407622 (https://phabricator.wikimedia.org/T162807) [10:37:50] (03PS2) 10Ema: cache_upload: upgrade cp4025 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/407620 (https://phabricator.wikimedia.org/T180433) [10:38:16] (03CR) 10Ema: [V: 032 C: 032] cache_upload: upgrade cp4025 to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/407620 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [10:39:05] !log cache_upload: upgrade cp4025 to varnish 5 [10:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:19] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407622 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [10:40:46] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407622 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [10:40:57] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407622 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [10:41:56] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1067 - T162807 (duration: 00m 55s) [10:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:09] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [10:43:46] !log cache_upload: repool cp4025 (varnish 5) [10:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:27] (03CR) 10Filippo Giunchedi: "> The linting error seems incorrect..." [puppet] - 10https://gerrit.wikimedia.org/r/407608 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [11:06:31] (03PS2) 10Ema: cache_upload: upgrade ulsfo to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/407621 (https://phabricator.wikimedia.org/T180433) [11:07:02] (03CR) 10Ema: [V: 032 C: 032] cache_upload: upgrade ulsfo to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/407621 (https://phabricator.wikimedia.org/T180433) (owner: 10Ema) [11:07:59] !log cache_upload: upgrade cp4026 to varnish 5 [11:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:28] !log cache_upload: repool cp4026 (varnish 5) [11:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:53] 10Operations, 10MediaWiki-Vagrant, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review: php-luasandbox in Wikimedia's Stretch apt repo depends on php5 - https://phabricator.wikimedia.org/T183888#3940663 (10MoritzMuehlenhoff) I've uploaded a backport of Kunal's 1.5.1-3 package from Debian testing to stretc... [11:21:30] (03CR) 10Muehlenhoff: [WIP] php7 manifests for mediawiki on stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [11:27:04] 10Operations, 10Packaging: rebuild php-wikidiff2 and php-luasandbox for php7 and stretch - https://phabricator.wikimedia.org/T184270#3940702 (10Legoktm) 05Open>03Resolved Both wikidiff2 and luasandbox are now in stretch-backports. [11:27:07] 10Operations, 10MediaWiki-Vagrant, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review: php-luasandbox in Wikimedia's Stretch apt repo depends on php5 - https://phabricator.wikimedia.org/T183888#3940705 (10Legoktm) [11:30:07] 10Operations, 10Packaging: rebuild php-wikidiff2 and php-luasandbox for php7 and stretch - https://phabricator.wikimedia.org/T184270#3940712 (10MoritzMuehlenhoff) In addition I'll drop the php-wikidiff2 from our internal src:php-wikidiff2 package (so that it only builds hhvm-wikidiff2). [11:31:53] PROBLEM - MegaRAID on db1051 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [11:32:50] ^ that is known I will ack it [11:32:58] and downtime it [11:33:39] ACKNOWLEDGEMENT - MegaRAID on db1051 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough Marostegui T186049 - The acknowledgement expires at: 2018-02-09 11:33:09. [11:38:19] 10Operations, 10wikitech.wikimedia.org: Remove cloud-admin rights from YuviPanda - https://phabricator.wikimedia.org/T186289#3940719 (10MarcoAurelio) [11:39:58] !log uploaded php-wikidiff2 1.5.1+deb9u2 to apt.wikimedia.org (despite the source package name, this package only builds hhvm-wikidiff2 now as php-wikidiff2 is instead updated via stretch-backports, the old internal package will eventually be phased out when we move to PHP7) [11:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:44] 10Operations, 10wikitech.wikimedia.org: Remove cloud-admin rights from YuviPanda - https://phabricator.wikimedia.org/T186289#3939839 (10MarcoAurelio) As for wikitech, any user from https://wikitech.wikimedia.org/wiki/Special:ListUsers?group=cloudadmin can do that. [11:41:53] RECOVERY - MegaRAID on db1051 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [11:42:07] how is that possible? [11:43:55] I did a re-learn [11:43:57] (03PS2) 10Filippo Giunchedi: Add Thumbor-Request-Id generated by nginx [puppet] - 10https://gerrit.wikimedia.org/r/407411 (https://phabricator.wikimedia.org/T179954) (owner: 10Gilles) [11:44:14] based on past experience, it may fail again soon :-/ [11:44:19] it will [11:46:31] (03CR) 10Joal: [C: 031] "Looks good to me, one question inline :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/406951 (https://phabricator.wikimedia.org/T181036) (owner: 10Elukey) [11:47:07] (03CR) 10Filippo Giunchedi: [C: 032] Add Thumbor-Request-Id generated by nginx [puppet] - 10https://gerrit.wikimedia.org/r/407411 (https://phabricator.wikimedia.org/T179954) (owner: 10Gilles) [11:52:14] 10Operations, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864#3940724 (10MoritzMuehlenhoff) mailman3-core, mailman3-hyperkitty, postorius and mailmanclient have been accepted into stretch-backports today. [11:57:24] !log roll-restart nginx on thumbor and swift-proxy on ms-fe to apply https://gerrit.wikimedia.org/r/407411 [11:57:31] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407627 [11:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:25] 10Operations, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864#3940731 (10MarcoAurelio) Does that mean we can start considering our migration? [12:05:15] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407627 (owner: 10Marostegui) [12:05:31] 10Operations, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864#3940735 (10MoritzMuehlenhoff) That's dependent on goal planning / road map considerations, I only meant to point out the availability in backports sinc... [12:06:56] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407627 (owner: 10Marostegui) [12:07:15] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407627 (owner: 10Marostegui) [12:08:00] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1067 - T162807 (duration: 00m 55s) [12:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:14] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [12:11:02] 10Operations, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864#3940741 (10Elitre) >>! In T52864#3940724, @MoritzMuehlenhoff wrote: > mailman3-core, mailman3-hyperkitty, postorius and mailmanclient have been acce... [12:15:06] 10Operations, 10MediaWiki-Vagrant, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review: php-luasandbox in Wikimedia's Stretch apt repo depends on php5 - https://phabricator.wikimedia.org/T183888#3940749 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff Our internal wikidiff2 package has... [12:31:02] (03CR) 10Muehlenhoff: [C: 04-1] Remove access for myself (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/407577 (owner: 10Yuvipanda) [12:34:20] (03CR) 10Muehlenhoff: [C: 04-1] "And BTW, you're also listed in modules/nagios_common/files/contactgroups.cfg; do you want to be kept in the sms contact group? And" [puppet] - 10https://gerrit.wikimedia.org/r/407577 (owner: 10Yuvipanda) [12:37:54] and now db1070 [12:55:43] PROBLEM - MegaRAID on db1070 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [12:55:44] ACKNOWLEDGEMENT - MegaRAID on db1070 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T186319 [12:55:48] 10Operations, 10ops-eqiad: Degraded RAID on db1070 - https://phabricator.wikimedia.org/T186319#3940831 (10ops-monitoring-bot) [12:56:28] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1070 - https://phabricator.wikimedia.org/T186319#3940835 (10Marostegui) This is s5 master [12:56:56] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1070 - https://phabricator.wikimedia.org/T186319#3940837 (10Marostegui) p:05Triage>03High [12:57:23] (03CR) 10Elukey: profile::analytics::refinery::job::camus: add netflow hourly job (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/406951 (https://phabricator.wikimedia.org/T181036) (owner: 10Elukey) [12:57:42] !log installing updated kernels on remaining jessie DB servers [12:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:13] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1070 - https://phabricator.wikimedia.org/T186319#3940831 (10Marostegui) a:03Cmjohnson @Cmjohnson this host is out of warranty Can we replace its disk as soon as possible - if possible before the weekend comes? [13:08:22] (03PS7) 10Elukey: profile::analytics::refinery::job::camus: add netflow hourly job [puppet] - 10https://gerrit.wikimedia.org/r/406951 (https://phabricator.wikimedia.org/T181036) [13:09:50] 10Operations, 10ops-eqiad, 10hardware-requests: decom iridium - https://phabricator.wikimedia.org/T172487#3940856 (10MoritzMuehlenhoff) >>! In T172487#3902655, @Dzahn wrote: > No, no point i debugging indeed. Instead it would be really nice if it could be shutdown after running such a long time doing nothing... [13:16:14] !log installing w3m security updates on trusty [13:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:08] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3940874 (10Paladox) Is it possible that phab carnt keep up with everyone conn... [13:20:00] (03PS8) 10Elukey: profile::analytics::refinery::job::camus: add netflow hourly job [puppet] - 10https://gerrit.wikimedia.org/r/406951 (https://phabricator.wikimedia.org/T181036) [13:44:33] 10Operations, 10Analytics-Kanban, 10User-Elukey: Expand meitnerium's root partition to 100G - https://phabricator.wikimedia.org/T186020#3940931 (10elukey) Is it a simple gnt-instance reboot meitnerium.wikimedia.org right? [13:48:16] (03PS1) 10Marostegui: db1100: Swtiching it to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/407633 (https://phabricator.wikimedia.org/T186321) [13:49:41] (03CR) 10Jcrespo: [C: 031] db1100: Swtiching it to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/407633 (https://phabricator.wikimedia.org/T186321) (owner: 10Marostegui) [13:49:49] (03PS1) 10Marostegui: db-eqiad.php: Clarifying that db1100 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407635 (https://phabricator.wikimedia.org/T186321) [13:50:44] (03CR) 10Jcrespo: [C: 031] "Will we need to depool it for a restart and/or hot reconfig?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407635 (https://phabricator.wikimedia.org/T186321) (owner: 10Marostegui) [13:51:47] (03CR) 10Marostegui: "> Will we need to depool it for a restart and/or hot reconfig?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407635 (https://phabricator.wikimedia.org/T186321) (owner: 10Marostegui) [13:53:49] 10Operations, 10ops-eqiad, 10hardware-requests: decom iridium - https://phabricator.wikimedia.org/T172487#3940941 (10Dzahn) Yes, i can't do them though because i don't have the access to disable switch ports. [13:55:26] 10Operations, 10Analytics-Kanban, 10User-Elukey: Expand meitnerium's root partition to 100G - https://phabricator.wikimedia.org/T186020#3940943 (10Dzahn) Yea, or gnt-instance shutdown gnt-instance startup [13:59:10] (03PS1) 10Muehlenhoff: Add library hint for librdkafka [puppet] - 10https://gerrit.wikimedia.org/r/407636 [13:59:48] !log reboot meitnerium via gnt-instance reboot on ganeti1005 to pick up new disk config - T184794 [14:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:00] T184794: Fix outstanding bugs preventing the use of prometheus jmx agent for Hive/Oozie - https://phabricator.wikimedia.org/T184794 [14:00:25] course the task is T186020 not this one [14:00:26] T186020: Expand meitnerium's root partition to 100G - https://phabricator.wikimedia.org/T186020 [14:00:39] (03CR) 10Jcrespo: [C: 031] "There is a typo on the title- it should also be changed per style guide to "...:Switch it ..."" [puppet] - 10https://gerrit.wikimedia.org/r/407633 (https://phabricator.wikimedia.org/T186321) (owner: 10Marostegui) [14:00:44] (03CR) 10Dzahn: "decom steps says it needs to happen after switch ports are disabled and i can't disable switch ports, so i'll have to leave this to Rob to" [dns] - 10https://gerrit.wikimedia.org/r/407173 (owner: 10Papaul) [14:01:24] (03PS2) 10Dzahn: DNS: Add production DNS entry for tendril2001 [dns] - 10https://gerrit.wikimedia.org/r/407454 (owner: 10Papaul) [14:01:27] (03CR) 10Jcrespo: [C: 031] "...: Clarify db1100..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407635 (https://phabricator.wikimedia.org/T186321) (owner: 10Marostegui) [14:03:16] jynus: is dbmonitor2001 not the same as tendril2001 ? [14:04:41] (03CR) 10Muehlenhoff: [C: 032] Add library hint for librdkafka [puppet] - 10https://gerrit.wikimedia.org/r/407636 (owner: 10Muehlenhoff) [14:05:46] mutante: one is the frontend, and another is the backed [14:06:05] jynus: ah !:) [14:06:16] but there is no "tendril1001" [14:06:33] well, it is not clear right now if it will be [14:06:46] will that happen later when db1011 will be replaced? [14:06:55] what I think is we shouldn't have different names for the same thing on different dcs [14:07:16] we can have if we want different names for backend and frontend, if we want [14:07:28] is tendril2001 the equivalent of db1011 and will use role(mariadb::tendril) ? [14:07:41] yes [14:07:44] ok, gotcha [14:08:00] i guess it means we should at some point rename that [14:08:20] the thing is we either rename tendril to db, or db to tendril, or both to something else [14:08:51] tendril1001/db1115 has not yet been setup, so better decide now [14:08:56] i can still just _not_ merge that change above that adds tendril2001 [14:08:56] tendril is not a good name [14:09:02] so far tendril is only a service name [14:09:16] because we may rename tendril service to other thing [14:09:30] that is why I called the frontends dbmonitor [14:09:33] more generic [14:09:57] dbmonitor-be ? [14:09:57] these should be dbmonitordb's, but that is a terrible name [14:10:12] -be is like we do for swift [14:10:30] marostegui may also be mad at me for confusing him [14:10:46] :'-( [14:11:22] renaming dbmonitors also is very easy because they are vms [14:12:08] what do the rest think, is dbmonitor-fe and dbmonitor-be clear? [14:12:41] dbmondb [14:12:47] lol [14:12:58] yes, -be and -fe follows the same pattern we use for others at least [14:13:20] however, we do not do that for other dbs [14:13:28] misc dbs are still dbs [14:13:35] (03CR) 10Dzahn: [C: 04-1] "talked with jynus. we should find a different name for these machines. tendril may be replaced by other software and isn't a good host nam" [dns] - 10https://gerrit.wikimedia.org/r/407454 (owner: 10Papaul) [14:14:04] yea, the other option is of course just to give it a db number like all before [14:14:07] up to you guys [14:14:20] but it seems clear we should not use tendril2001 then [14:14:34] those db hosts may be also used to host a snall, private installation of prometheus [14:14:42] for private queries [14:15:41] faidon had some issues with so many db* [14:16:05] maybe it would be a would time to use db* only for mediawiki metadata databases [14:16:16] and setup miscdb* hosts [14:16:52] although another thing is that we change those sometimes [14:17:10] while new tendril hosts have specific diffferent hardware [14:20:14] PROBLEM - Host etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [14:20:14] PROBLEM - Host fermium is DOWN: PING CRITICAL - Packet loss = 100% [14:20:54] PROBLEM - Host webperf1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:00] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install tendril2001 - https://phabricator.wikimedia.org/T186123#3934359 (10Dzahn) After talking with jcrespo on IRC: We should use a different name for this system. So far tendril is only a service name, not a host name. And tendril might be... [14:21:03] PROBLEM - Host mwdebug1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:03] PROBLEM - Host ununpentium is DOWN: PING CRITICAL - Packet loss = 100% [14:21:13] PROBLEM - Host releases1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:13] PROBLEM - Host kubestagetcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:23] PROBLEM - Host logstash1008 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:54] I was copying data from /var/lib/archiva on meitnerium to another mount point, the host froze so I guess that this mess is due to me [14:22:13] PROBLEM - ganeti-noded running on ganeti1005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded [14:22:46] I didn't really expect this issue [14:23:04] RECOVERY - ganeti-noded running on ganeti1005 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded [14:23:38] it's the same error we'd been seeing before [14:23:52] https://phabricator.wikimedia.org/T181121 [14:24:07] moritzm: Thanks for https://phabricator.wikimedia.org/T52864#3940724 :-) That's interesting news. [14:24:34] PROBLEM - Request latencies on argon is CRITICAL: CRITICAL - apiserver_request_latencies is 644969 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:25:11] moritzm: ganeti1005 works fine, no idea how to restore the instances.. [14:25:27] mhh, this might be somewhat different, though, there's various OOM killer logs for qemu processes, don't think we've seen that before in T181121 [14:25:28] T181121: Hardware errors on ganeti1005- ganeti1008 - https://phabricator.wikimedia.org/T181121 [14:25:33] PROBLEM - etcd request latencies on argon is CRITICAL: CRITICAL - etcd_request_latencies is 559363 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:25:34] RECOVERY - Request latencies on argon is OK: OK - apiserver_request_latencies is 5671 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:25:43] RECOVERY - Host ununpentium is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [14:25:43] RECOVERY - Host logstash1008 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [14:25:47] ah ok [14:25:53] RECOVERY - Host mwdebug1002 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [14:25:53] RECOVERY - Host releases1001 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [14:25:55] yeah, they should recover soonish [14:26:02] so cpu and load went up when I started the cp [14:26:05] RECOVERY - Host kubestagetcd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [14:26:05] RECOVERY - Host webperf1001 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [14:26:13] RECOVERY - Host etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 1.34 ms [14:26:13] RECOVERY - Host fermium is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [14:26:23] I'll poke at logs and compare to what we collected at T181121 [14:26:33] RECOVERY - etcd request latencies on argon is OK: OK - etcd_request_latencies is 3828 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:28:41] 10Operations, 10Analytics-Kanban, 10User-Elukey: Expand meitnerium's root partition to 100G - https://phabricator.wikimedia.org/T186020#3940974 (10elukey) New disk in place, added ext4 and everything looks good. I mounted /dev/vdb1 to /mnt/archiva and started a cp -a from /var/lib/archiva to that dir, but ga... [14:29:14] I am going to stop for today working on that host :( [14:35:22] (03PS2) 10Arturo Borrero Gonzalez: WIP: apt: merge script report-pending-upgrades to apt-upgrade [puppet] - 10https://gerrit.wikimedia.org/r/407465 (https://phabricator.wikimedia.org/T181647) [14:35:28] (03CR) 10jerkins-bot: [V: 04-1] WIP: apt: merge script report-pending-upgrades to apt-upgrade [puppet] - 10https://gerrit.wikimedia.org/r/407465 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [14:40:22] (03CR) 10Elukey: [C: 032] README.md: add a note about glibc depencency [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/407587 (https://phabricator.wikimedia.org/T186169) (owner: 10Elukey) [14:40:30] 10Operations, 10Analytics-Kanban, 10User-Elukey: Expand meitnerium's root partition to 100G - https://phabricator.wikimedia.org/T186020#3940981 (10akosiaris) How nice :(. But it does look like disk IO is a possible reproduction scenario for T181121. I 'll empty ganeti1005 to avoid having any worse problems d... [14:40:33] PROBLEM - etcd request latencies on argon is CRITICAL: CRITICAL - etcd_request_latencies is 4683890 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:41:06] akosiaris: sorry for the trouble :( [14:41:33] RECOVERY - etcd request latencies on argon is OK: OK - etcd_request_latencies is 3703 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:42:25] 10Operations, 10ops-eqiad: Hardware errors on ganeti1005- ganeti1008 - https://phabricator.wikimedia.org/T181121#3940985 (10akosiaris) During some disk IO in T186020, ganeti1005 exhibited the usual symptoms. This hasn't been triggered for ~1 month so maybe we have a reproduction scenario in some heavy IO ? An... [14:43:07] 10Operations, 10ops-eqiad: Hardware errors on ganeti1005- ganeti1008 - https://phabricator.wikimedia.org/T181121#3940987 (10MoritzMuehlenhoff) Happened again on ganeti1005, similar errors, but this time triggered by a copy of the Archiva data. [14:51:52] (03CR) 10Alexandros Kosiaris: [C: 04-2] "This is a git submodule (there are more, listed at https://phabricator.wikimedia.org/source/operations-puppet/browse/production/.gitmodule" [puppet] - 10https://gerrit.wikimedia.org/r/407518 (https://phabricator.wikimedia.org/T186268) (owner: 10Herron) [14:53:04] !log reboot ganeti1005 after emptying it. T181121 [14:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:17] T181121: Hardware errors on ganeti1005- ganeti1008 - https://phabricator.wikimedia.org/T181121 [14:54:23] PROBLEM - Host ganeti1005 is DOWN: PING CRITICAL - Packet loss = 100% [14:55:03] RECOVERY - Host ganeti1005 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [14:58:20] (03CR) 10Muehlenhoff: "But it would be great to simply fold nginx into the main operations puppet.git, having that as a separate sub module is probably entirely " [puppet] - 10https://gerrit.wikimedia.org/r/407518 (https://phabricator.wikimedia.org/T186268) (owner: 10Herron) [15:02:52] (03CR) 10Alexandros Kosiaris: [C: 04-2] "FWIW, I think so too. We 've had this discussion multiple times and I think we agree that we haven't really seen any discernible advantage" [puppet] - 10https://gerrit.wikimedia.org/r/407518 (https://phabricator.wikimedia.org/T186268) (owner: 10Herron) [15:06:45] (03PS1) 10Alexandros Kosiaris: admin: Add builder-docker group extending ops rights [puppet] - 10https://gerrit.wikimedia.org/r/407642 [15:09:23] (03PS2) 10BBlack: URL Path Normalization: refactor, add to cache_text [puppet] - 10https://gerrit.wikimedia.org/r/407488 (https://phabricator.wikimedia.org/T127387) [15:09:25] (03PS2) 10BBlack: URL Path Normalization: add to cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/407489 (https://phabricator.wikimedia.org/T127387) [15:09:27] (03PS1) 10BBlack: URL Path Normalization: fully normalize cache_text [puppet] - 10https://gerrit.wikimedia.org/r/407643 (https://phabricator.wikimedia.org/T127387) [15:15:27] 10Operations, 10ops-eqiad, 10Analytics-Kanban: BBU alarms flapping for analytics1038 - https://phabricator.wikimedia.org/T185409#3941041 (10elukey) >>! In T185409#3921916, @RobH wrote: > This is an older R720xd, and uses an older H710 controller. > > While @Cmjohnson can check for a spare when back onsite,... [15:25:25] !log ganeti1004 - stopped and started VM ununpentium [15:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:25] icinga-wm: talk [15:34:09] !log uploaded HHVM 3.18.5+dfsg+wmf5+icu57 to jessie-wikimedia/component/icu57 (HHVM 3.18.8 linked against an ICU 57 backport from stretch) [15:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:51] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Clarifying that db1100 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407635 (https://phabricator.wikimedia.org/T186321) (owner: 10Marostegui) [15:38:32] mutante: can you advise me a bit more about the httpd module? I'm trying to enable libapache2-mod-wsgi-py3 and can't figure out 1) what the right place is to get the actual package installed and 2) how to enable it once it's installed. [15:39:14] (03PS2) 10Marostegui: db1100: Switch it to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/407633 (https://phabricator.wikimedia.org/T186321) [15:40:02] (03Merged) 10jenkins-bot: db-eqiad.php: Clarifying that db1100 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407635 (https://phabricator.wikimedia.org/T186321) (owner: 10Marostegui) [15:40:14] (03CR) 10jenkins-bot: db-eqiad.php: Clarifying that db1100 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407635 (https://phabricator.wikimedia.org/T186321) (owner: 10Marostegui) [15:40:24] 10Operations, 10Analytics, 10Research, 10Traffic, and 6 others: Referrer policy for browsers which only support the old spec - https://phabricator.wikimedia.org/T180921#3941061 (10Nuria) I see no problem with changset, my comment was about pointing out that "Flipping these Edges/Safaris to origin is going... [15:41:53] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install tendril2001 - https://phabricator.wikimedia.org/T186123#3941063 (10Marostegui) As I have stated on T185788 my personal preference is to keep using db* on both, eqiad and codfw. I was also fine with tendrilXXXX. I honestly thing we a... [15:42:12] (03PS3) 10Marostegui: db1100: Switch it to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/407633 (https://phabricator.wikimedia.org/T186321) [15:43:32] andrewbogott: the place to enable a module is in a role class and you pass the list of modules to it as parameters, so example: [15:43:40] class { '::httpd': [15:43:40] modules => ['alias', 'ssl', 'php5', 'rewrite', 'headers', 'wsgi', 'expires', 'lbmethod_byrequests', 'proxy', 'proxy_balancer', 'proxy_http'], [15:43:48] mutante: right, but it doesn't work :) [15:43:49] (03CR) 10Marostegui: [C: 032] db1100: Switch it to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/407633 (https://phabricator.wikimedia.org/T186321) (owner: 10Marostegui) [15:43:53] (03PS1) 10Marostegui: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407650 (https://phabricator.wikimedia.org/T18632) [15:44:10] that's new to me :/ [15:44:20] Error: /Stage[main]/Httpd/Httpd::Mod_conf[wsgi-py3]/Exec[ensure_present_mod_wsgi-py3]/returns: change from notrun to 0 failed: /usr/sbin/a2enmod wsgi-py3 returned 1 instead of one of [0] [15:44:27] I've tried with both wsgi-py3 and just 'wsgi' as a control [15:44:42] well, and also php5 [15:44:56] on which host is it? [15:45:05] I also don't see any code attached to that module that would actually install the package that provides the module [15:45:39] abogott-horizonsourcedeploy.testlabs.eqiad.wmflabs [15:46:18] the puppet profile of interest is on abogott-puppetmaster.testlabs in /var/lib/git/operations/puppet/modules/profile/manifests/openstack/base/horizon/dashboard_source_deploy.pp [15:46:48] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407650 (https://phabricator.wikimedia.org/T18632) (owner: 10Marostegui) [15:47:29] give me a minute.. looking [15:47:34] thanks [15:47:43] 10Operations, 10Analytics, 10Research, 10Traffic, and 6 others: Referrer policy for browsers which only support the old spec - https://phabricator.wikimedia.org/T180921#3941078 (10Nuria) Safari sessions still appear "shorter": {F12961703} Compare to chrome ones: {F12961706} So Safari is sending us no da... [15:48:09] as best I can tell httpd just assumes that all the mods that you might need are already installed. Not sure where they come from [15:48:25] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407650 (https://phabricator.wikimedia.org/T18632) (owner: 10Marostegui) [15:48:36] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407650 (https://phabricator.wikimedia.org/T18632) (owner: 10Marostegui) [15:49:44] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1100 - T186321 (duration: 00m 55s) [15:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:56] T186321: Prepare and indicate proper master db failover candidates for all database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321 [15:50:03] !log Restart MySQL on db1100 - T186321 [15:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:48] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install tendril2001 - https://phabricator.wikimedia.org/T186123#3941091 (10Papaul) Thanks to all for the name discussion, but so far no decision has been made yet if we are keeping the same name or changing the name. Please confirm if we ar... [15:54:46] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install tendril2001 - https://phabricator.wikimedia.org/T186123#3941112 (10jcrespo) The things is people like @faidon expressed that our current schema names was confusing for him, and I can see a reason why. We can run, even with difficulty,... [15:56:47] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407652 [15:58:42] andrewbogott: fixed! 2 things: first you are right it doesnt auto-install it, so i added require_package('libapache2-mod-wsgi-py3') (in the profile, though in prod the httpd class declaration would be in the role (if there are more than one service on a node to avoid duplicate definition) and then the second thing is, the package name isnt like the module name. [15:58:48] libapache2-mod-wsgi-py3 provides only "mod_wsgi" but not "mod_wsgi_py3" or so [15:59:01] dpkg -L libapache2-mod-wsgi-py3 [15:59:11] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407652 (owner: 10Marostegui) [15:59:55] ah, so it's called 'wsgi' regardless of if it's python3 or python2 [16:00:31] 10Operations, 10Ops-Access-Requests: Requesting access to analytics-users / webrequest for Esteban - https://phabricator.wikimedia.org/T185988#3941132 (10Esteban) Hello Dzahn, Thanks for your answer, here are mines : I have not yet worked with anyone from WMF and here is the reason why I request access to we... [16:00:38] I feel like that means that on a fresh install this will go poorly… but I will try [16:00:40] Thank you for looking [16:00:47] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407652 (owner: 10Marostegui) [16:00:48] andrewbogott: you can see my change with "git diff" , i didnt commit [16:00:59] well, local puppetmaster [16:01:01] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407652 (owner: 10Marostegui) [16:02:00] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1100 - T186321 (duration: 00m 54s) [16:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:12] T186321: Prepare and indicate proper master db failover candidates for all database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321 [16:09:32] 10Operations, 10fundraising-tech-ops, 10monitoring: ssl monitoring: add civicrm.wikimedia.org to icinga - https://phabricator.wikimedia.org/T186328#3941156 (10RobH) p:05Triage>03High [16:11:02] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407660 [16:12:50] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407660 (owner: 10Marostegui) [16:14:22] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407660 (owner: 10Marostegui) [16:15:41] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1100 (duration: 00m 54s) [16:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:11] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407660 (owner: 10Marostegui) [16:17:32] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1070 - https://phabricator.wikimedia.org/T186319#3941192 (10Marostegui) Thanks @Cmjohnson for replacing this disk so fast! ``` root@db1070:~# megacli -PDRbld -ShowProg -PhysDrv [32:0] -a0 Rebuild Progress on Device at Enclosure 32, Slot 0 Completed 17% in... [16:18:38] (03PS3) 10Arturo Borrero Gonzalez: WIP: apt: merge report-pending-upgrades script into apt-upgrade [puppet] - 10https://gerrit.wikimedia.org/r/407465 (https://phabricator.wikimedia.org/T181647) [16:21:31] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407663 [16:27:23] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407663 (owner: 10Marostegui) [16:27:40] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/407433 (https://phabricator.wikimedia.org/T185216) (owner: 10Filippo Giunchedi) [16:32:33] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407663 (owner: 10Marostegui) [16:32:47] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407663 (owner: 10Marostegui) [16:33:37] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1100 (duration: 00m 55s) [16:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:12] (03PS1) 10Marostegui: db-eqiad.php: Restore original weight for db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407666 [16:42:47] (03PS3) 10BBlack: URL Path Normalization: refactor, add to cache_text [puppet] - 10https://gerrit.wikimedia.org/r/407488 (https://phabricator.wikimedia.org/T127387) [16:42:49] (03PS2) 10BBlack: URL Path Normalization: fully normalize cache_text [puppet] - 10https://gerrit.wikimedia.org/r/407643 (https://phabricator.wikimedia.org/T127387) [16:42:51] (03PS3) 10BBlack: URL Path Normalization: add to cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/407489 (https://phabricator.wikimedia.org/T127387) [16:42:53] (03PS1) 10BBlack: URL Normalization: strip fragment [puppet] - 10https://gerrit.wikimedia.org/r/407670 (https://phabricator.wikimedia.org/T127387) [16:42:57] (03PS1) 10BBlack: URL Normalization: normalize query chars as well [puppet] - 10https://gerrit.wikimedia.org/r/407671 [16:42:59] (03PS4) 10Arturo Borrero Gonzalez: apt: merge report-pending-upgrades script into apt-upgrade [puppet] - 10https://gerrit.wikimedia.org/r/407465 (https://phabricator.wikimedia.org/T181647) [16:43:15] chasemp: ok, no longer WIP ^^^ [16:45:00] 10Operations, 10ops-eqiad: check eventlog1002 production network cable - https://phabricator.wikimedia.org/T186252#3941235 (10Cmjohnson) I verified the cable, checked and noticed NIC1 PXE was disabled. Enabled and disabled NIC3 from PXE, rebooted the server. I see the attempt to get an image but no offers rec... [16:46:17] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore original weight for db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407666 (owner: 10Marostegui) [16:47:55] (03Merged) 10jenkins-bot: db-eqiad.php: Restore original weight for db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407666 (owner: 10Marostegui) [16:48:06] (03CR) 10jenkins-bot: db-eqiad.php: Restore original weight for db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407666 (owner: 10Marostegui) [16:49:40] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore original traffic for db1100 (duration: 00m 54s) [16:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:31] 10Operations, 10ops-eqiad: check eventlog1002 production network cable - https://phabricator.wikimedia.org/T186252#3938674 (10elukey) Is this correct? ``` host eventlog1002 { hardware ethernet 14:18:77:5B:0D:42; fixed-address eventlog1001.eqiad.wmnet; } ``` [16:52:42] (03PS1) 10Elukey: Fix eventlog1002's dhcp configuration [puppet] - 10https://gerrit.wikimedia.org/r/407675 (https://phabricator.wikimedia.org/T186252) [16:53:51] cmjohnson1: --^ [16:55:59] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi Zayo Circuit down. Case: TTN-0002180438 [17:00:58] (03PS1) 10RobH: fixing eventlog1002 entry [dns] - 10https://gerrit.wikimedia.org/r/407677 (https://phabricator.wikimedia.org/T185667) [17:01:39] (03CR) 10RobH: [C: 032] Fix eventlog1002's dhcp configuration [puppet] - 10https://gerrit.wikimedia.org/r/407675 (https://phabricator.wikimedia.org/T186252) (owner: 10Elukey) [17:01:53] elukey: ^ thx i just found that and saw you made a patch already! [17:01:58] i was having install issues and that is why! [17:02:07] going to merge then! :) [17:02:10] well ayep [17:02:15] =] [17:02:37] ah nice! :) [17:02:40] that and there was an ipv6 typo [17:02:45] and an issue with the network cable [17:02:48] so multiple issues [17:02:53] but should all be fixed now =] [17:02:58] super [17:02:59] (shortly) [17:03:19] (03CR) 10RobH: [C: 032] fixing eventlog1002 entry [dns] - 10https://gerrit.wikimedia.org/r/407677 (https://phabricator.wikimedia.org/T185667) (owner: 10RobH) [17:04:33] 10Operations, 10ops-codfw: mc2036 mainboard fuse failure - https://phabricator.wikimedia.org/T185587#3941308 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below. Your request... [17:09:09] 10Operations, 10ops-eqiad: Hardware check on mw1271 - https://phabricator.wikimedia.org/T184722#3941309 (10Cmjohnson) @MoritzMuehlenhoff The error has not returned. Please feel free to re-pool [17:09:24] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1070 - https://phabricator.wikimedia.org/T186319#3941310 (10Marostegui) 05Open>03Resolved ``` root@db1070:~# megacli -LDInfo -L0 -a0 Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name : RAID Level : P... [17:11:57] 10Operations, 10ops-codfw: mc2036 mainboard fuse failure - https://phabricator.wikimedia.org/T185587#3941319 (10Papaul) a:05Papaul>03RobH @Robh please see below for instruction on how to fix this problem. We need to run the file within the OS. so download the file, copy it somewhere on the server and run i... [17:13:27] 10Operations, 10ops-eqiad: check americium eth1 cabling and link - https://phabricator.wikimedia.org/T185219#3909801 (10Cmjohnson) I confirmed that the cable is plugged into port1 (labeled on the server) and also plugged into fasw-c1a 1/0/10. i see green link lights on both the port and server [17:15:45] RECOVERY - MegaRAID on db1070 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [17:17:23] 10Operations, 10ops-codfw: mc2036 mainboard fuse failure - https://phabricator.wikimedia.org/T185587#3941347 (10RobH) a:05RobH>03Papaul Papaul, I'm not entirely certain what has happened with this system. Can you please clarify the troubleshooting that has taken place? Has the mainboard been replaced, o... [17:19:55] 10Operations, 10ops-eqiad, 10hardware-requests: decom iridium - https://phabricator.wikimedia.org/T172487#3941353 (10RobH) a:03RobH [17:21:43] 10Operations, 10ops-eqiad, 10hardware-requests: decom iridium - https://phabricator.wikimedia.org/T172487#3499956 (10RobH) so this host ssh is down, so i cannot disable puppet on the host. I'll do the remainder of the uninterruptible steps now. [17:23:28] 10Operations, 10ops-eqiad, 10hardware-requests: decom iridium - https://phabricator.wikimedia.org/T172487#3941356 (10RobH) [17:25:05] (03PS1) 10RobH: decom iridium [dns] - 10https://gerrit.wikimedia.org/r/407683 (https://phabricator.wikimedia.org/T172487) [17:26:34] (03PS1) 10RobH: decom iridium [puppet] - 10https://gerrit.wikimedia.org/r/407684 (https://phabricator.wikimedia.org/T172487) [17:26:50] (03CR) 10RobH: [C: 032] decom iridium [dns] - 10https://gerrit.wikimedia.org/r/407683 (https://phabricator.wikimedia.org/T172487) (owner: 10RobH) [17:27:24] (03PS1) 10Jforrester: MWWikiversions::readDbListFile: Don't throw if the dblist doesn't exist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407685 [17:27:26] (03PS1) 10Jforrester: MWWikiversions::writeWikiVersionsFile: No need to support PHP 5.3 any more [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407686 [17:27:34] (03CR) 10RobH: [C: 032] decom iridium [puppet] - 10https://gerrit.wikimedia.org/r/407684 (https://phabricator.wikimedia.org/T172487) (owner: 10RobH) [17:28:27] (03CR) 10Jforrester: [C: 04-2] "POC at this point." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407685 (owner: 10Jforrester) [17:28:29] 10Operations, 10ops-eqiad, 10hardware-requests: decom iridium - https://phabricator.wikimedia.org/T172487#3941368 (10RobH) [17:28:43] 10Operations, 10ops-eqiad, 10hardware-requests: decom iridium - https://phabricator.wikimedia.org/T172487#3499956 (10RobH) a:05RobH>03Cmjohnson Ok, this is now ready for onsite wipe. [17:29:48] 10Operations, 10Analytics, 10Patch-For-Review: setup/install eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#3941375 (10Cmjohnson) [17:29:50] 10Operations, 10ops-eqiad: apply hostname labels to eventlog1001/WMF4751 - https://phabricator.wikimedia.org/T185668#3941373 (10Cmjohnson) 05Open>03Resolved done [17:30:13] 10Operations, 10RESTBase, 10RESTBase-Cassandra, 10Patch-For-Review, 10Services (watching): rename cassandra cluster - https://phabricator.wikimedia.org/T112257#3941376 (10Eevans) 05stalled>03Resolved a:03Eevans All clusters have a unique name (have for some time); Closing this ticket. [17:30:24] 10Operations, 10Analytics: setup/install eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#3941379 (10RobH) [17:32:02] see you on belgium [17:32:51] 10Operations, 10Analytics: setup/install eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#3941388 (10RobH) [17:32:53] 10Operations, 10ops-eqiad, 10Patch-For-Review: check eventlog1002 production network cable - https://phabricator.wikimedia.org/T186252#3941386 (10RobH) 05Open>03Resolved Chris got this working earlier today, resolving. [17:33:09] 10Operations, 10ops-eqiad, 10DBA: db1051 database host BBU issues - https://phabricator.wikimedia.org/T186049#3941390 (10Cmjohnson) @marostegui Let's do this Tuesday (my morning) 1500UTC [17:33:45] 10Operations, 10ops-eqiad, 10DBA: db1051 database host BBU issues - https://phabricator.wikimedia.org/T186049#3941392 (10Cmjohnson) Tuesday 6 Feb [17:34:15] 10Operations, 10Analytics: setup/install eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#3922494 (10RobH) a:05RobH>03Ottomata So due to both Faidon and Mortiz's comments, I've gone ahead and installed with stretch. If it needs to be re-imaged to fall back to an older distro, then dhcp... [17:34:24] 10Operations, 10Analytics: setup/install eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#3941396 (10RobH) [17:34:49] 10Operations, 10ops-eqiad, 10Analytics-Kanban: BBU alarms flapping for analytics1038 - https://phabricator.wikimedia.org/T185409#3941409 (10Cmjohnson) Can this be done around 1500UTC 6 Feb? I will be swapping out another bbu at the same time. [17:35:04] 10Operations, 10Analytics: setup/install eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#3941410 (10elukey) >>! In T185667#3936830, @faidon wrote: > I had a look at both `modules/eventlogging/files/eventloggingctl` and `modules/eventlogging/templates/upstart/*`. They all seemed fairly easy... [17:36:34] 10Operations, 10ops-eqiad, 10Analytics-Kanban: BBU alarms flapping for analytics1038 - https://phabricator.wikimedia.org/T185409#3941413 (10elukey) >>! In T185409#3941409, @Cmjohnson wrote: > Can this be done around 1500UTC 6 Feb? I will be swapping out another bbu at the same time. Fine to me! We have a b... [17:48:51] 10Operations, 10ops-eqiad, 10hardware-requests: decom iridium - https://phabricator.wikimedia.org/T172487#3941433 (10MoritzMuehlenhoff) That host has a broken sshd config (coming from Phabricator), but it's possible to login via mgmt and the root password. [17:50:38] (03PS36) 10Aaron Schulz: Add mcrouter module and mcrouter_wancache profile and enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/392221 [17:53:24] (03PS6) 10Andrew Bogott: openstack horizon: rough in manifests for source deploy of Horizon 'ocata' [puppet] - 10https://gerrit.wikimedia.org/r/406853 (https://phabricator.wikimedia.org/T168470) [17:53:52] (03CR) 10jerkins-bot: [V: 04-1] openstack horizon: rough in manifests for source deploy of Horizon 'ocata' [puppet] - 10https://gerrit.wikimedia.org/r/406853 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [17:54:30] (03PS2) 10Ayounsi: Postgres: remove hardcoded version [puppet] - 10https://gerrit.wikimedia.org/r/404516 (https://phabricator.wikimedia.org/T184634) [17:55:28] (03CR) 10Ayounsi: [C: 032] Postgres: remove hardcoded version [puppet] - 10https://gerrit.wikimedia.org/r/404516 (https://phabricator.wikimedia.org/T184634) (owner: 10Ayounsi) [17:56:20] (03Abandoned) 10Herron: remove empty directory modules/nginx [puppet] - 10https://gerrit.wikimedia.org/r/407518 (https://phabricator.wikimedia.org/T186268) (owner: 10Herron) [17:57:15] 10Operations, 10Puppet, 10Patch-For-Review: Puppet: Empty modules/nginx directory in operations/puppet - https://phabricator.wikimedia.org/T186268#3941445 (10herron) 05Open>03declined [17:57:20] 10Operations, 10Puppet: Puppet: Empty modules/nginx directory in operations/puppet - https://phabricator.wikimedia.org/T186268#3939224 (10herron) [17:58:03] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: Cleanup multiple definitions of logstash endpoint in puppet / hiera - https://phabricator.wikimedia.org/T182304#3941447 (10debt) 05Open>03Resolved [17:59:20] 10Operations, 10ops-eqiad, 10DBA: db1051 database host BBU issues - https://phabricator.wikimedia.org/T186049#3941461 (10Marostegui) Great! Will have the server ready by then Thanks! [18:00:07] 10Operations, 10ops-eqiad, 10User-Eevans: Degraded RAID on restbase-dev1006 - https://phabricator.wikimedia.org/T185494#3941468 (10Cmjohnson) a case has been opened with HPE Your case was successfully submitted. Please note your Case ID: 5326748362 for future reference. [18:01:17] 10Operations, 10Patch-For-Review: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850#3941490 (10Cmjohnson) [18:01:19] 10Operations, 10ops-eqiad: investigate lead hardware issue - https://phabricator.wikimedia.org/T147905#3941488 (10Cmjohnson) 05Open>03Resolved Server is decom'd [18:04:47] (03PS2) 10Yuvipanda: Remove access for myself [puppet] - 10https://gerrit.wikimedia.org/r/407577 [18:05:47] (03CR) 10Yuvipanda: "Done!" [puppet] - 10https://gerrit.wikimedia.org/r/407577 (owner: 10Yuvipanda) [18:05:58] (03CR) 10Yuvipanda: "@legoktm :( indeed." [puppet] - 10https://gerrit.wikimedia.org/r/407577 (owner: 10Yuvipanda) [18:08:04] awwww [18:14:18] (03PS1) 10BBlack: Browser connection security warnings, again [puppet] - 10https://gerrit.wikimedia.org/r/407701 [18:14:56] (03PS1) 10Subramanya Sastry: Enable RemexHtml on fiwiki, hewiki, ruwiki, svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407702 (https://phabricator.wikimedia.org/T185945) [18:25:23] (03PS1) 10Subramanya Sastry: Enable RemexHtml on wikis with < 10 errors in all high-priority categories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407706 (https://phabricator.wikimedia.org/T184656) [18:29:15] 10Operations, 10ops-eqiad, 10hardware-requests: decom iridium - https://phabricator.wikimedia.org/T172487#3941594 (10RobH) >>! In T172487#3941433, @MoritzMuehlenhoff wrote: > That host has a broken sshd config (coming from Phabricator), but it's possible to login via mgmt and the root password. Done! power... [18:29:30] 10Operations, 10ops-eqiad, 10hardware-requests: decom iridium - https://phabricator.wikimedia.org/T172487#3941595 (10RobH) [18:32:23] (03CR) 10Jforrester: [C: 031] Enable RemexHtml on wikis with < 10 errors in all high-priority categories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407706 (https://phabricator.wikimedia.org/T184656) (owner: 10Subramanya Sastry) [18:32:27] (03CR) 10Jforrester: [C: 031] Enable RemexHtml on fiwiki, hewiki, ruwiki, svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407702 (https://phabricator.wikimedia.org/T185945) (owner: 10Subramanya Sastry) [19:01:47] 10Operations, 10LDAP-Access-Requests: NDA request for Samtar - https://phabricator.wikimedia.org/T186344#3941678 (10Samtar) [19:03:16] 10Operations, 10LDAP-Access-Requests: NDA request for Samtar - https://phabricator.wikimedia.org/T186344#3941664 (10Samtar) cc @RobH who previously dealt with T174316 [19:04:37] 10Operations, 10LDAP-Access-Requests: NDA request for Samtar - https://phabricator.wikimedia.org/T186344#3941696 (10RobH) So if your NDA with WMF legal has expired, you'll have to get a new one signed and on file with them for this request. Has that already been done? (If not, this will have to be stalled un... [19:08:57] 10Operations, 10LDAP-Access-Requests: NDA request for Samtar - https://phabricator.wikimedia.org/T186344#3941701 (10Samtar) @RobH per T157483#3026051 "..NDA doesn't actually have an expiry date", I'm not //entirely sure// it has expired (I just think my access to the stats server has). Happy however to re-sign... [19:11:28] 10Operations, 10LDAP-Access-Requests: NDA request for Samtar - https://phabricator.wikimedia.org/T186344#3941703 (10RobH) @samtar: You are indeed correct, as far as I can tell your NDA has no expiry set (I have access to the WMF NDA google sheet that lists everyone who has one on file and any expiry.) I think... [19:13:45] 10Operations, 10LDAP-Access-Requests: NDA request for Samtar - https://phabricator.wikimedia.org/T186344#3941704 (10Samtar) No rush by definition of the word :-) [19:14:50] 10Operations, 10Beta-Cluster-Infrastructure, 10Wikimedia-General-or-Unknown: Beta English Wikipedia: History of the page 'Bird' generates a 500 or 503 error - https://phabricator.wikimedia.org/T185969#3941717 (10Paladox) [19:19:58] 10Operations, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864#3941723 (10Legoktm) I think we need to lobby/convince/remind @faidon and other roadmap deciders to allocate resources for this :) [19:20:39] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/407447 (https://phabricator.wikimedia.org/T185216) (owner: 10Filippo Giunchedi) [19:21:47] 10Operations, 10Packaging: rebuild php-wikidiff2 and php-luasandbox for php7 and stretch - https://phabricator.wikimedia.org/T184270#3941725 (10ArielGlenn) I've verified that the beta snapshot instance picks up this package with no tweaks to repos or pinning needed. Thanks! [19:22:08] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3941726 (10Paladox) See https://wiki.apache.org/httpd/php [19:44:46] (03PS9) 10Paladox: Gerrit: Switch to the mariadb connector [puppet] - 10https://gerrit.wikimedia.org/r/384588 (https://phabricator.wikimedia.org/T176164) [19:45:36] (03PS1) 10Jforrester: Tidy: Re-do this as a sorted negative list that gets shorter over time [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407727 [19:47:41] (03PS6) 10Paladox: Gerrit: Remove velocity templates but keep the ones for its-base [puppet] - 10https://gerrit.wikimedia.org/r/353693 (https://phabricator.wikimedia.org/T158008) [19:48:13] (03PS5) 10Paladox: Gerrit: remove libbcprov-java and libbcpkix-java packages [puppet] - 10https://gerrit.wikimedia.org/r/385105 [20:09:33] (03PS3) 10Krinkle: Proxy public wiki thumb.php requests through Thumbor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407611 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [20:09:53] (03CR) 10Krinkle: [C: 031] "tabs>spaces fix, + line split parens for minor clarity" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407611 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [20:19:06] (03PS3) 10Paladox: vagrant::mediawiki: Create /srv/mediawiki-vagrant/.vagrant/machines [puppet] - 10https://gerrit.wikimedia.org/r/406484 (https://phabricator.wikimedia.org/T180377) [20:24:16] (03CR) 10Legoktm: [C: 04-2] "As I said on ops@, I think this is conceptually a bad idea." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407685 (owner: 10Jforrester) [20:25:50] The fix should have been on not allowing the deploy to progress, after detecting high error rate^ [20:27:06] mutante: o/ would Tuesday morning be a good time to deploy the research site? [20:27:28] jynus: That was because the baseline which we measured error rates used the higher error rate (from the failure). There is a fix for this (in flight? deployed yet? cc thcipriani). [20:27:42] Also that scap2 doesn't have rollback functionality (it never has) [20:28:05] oh, so there is a new verstion to deploy mediawiki in preparation? [20:28:10] But I'm working on a patch to better warn users when they should *probably* roll back [20:28:34] scap2 is going to start behaving more like scap3, eventually [20:28:38] It's a long tail of behaviors [20:28:59] in flight https://gerrit.wikimedia.org/r/#/c/403574/ [20:29:02] I am no sure I understand, but please do not waste time explaining it to me [20:29:11] I will research on my own [20:31:03] deploy try #1: is stopped because it raises the 10 minute average on the canaries, GOOD, deploy try #2: is not stopped because the 10 minute average on the canaries is elevanted due to try #1. (I forget what the time is, maybe 1 hour average?) [20:31:07] tl;dr & [20:31:14] s/&/^/ [20:31:29] ah, I get it [20:31:42] (03CR) 10Subramanya Sastry: [C: 031] "This is a useful way of organizing this. Thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407727 (owner: 10Jforrester) [20:31:43] thcipriani: Another easy-ish fix we could do is change the InitialiseSettings code to just load all *.dblist files at once, so you don't have to worry about syncing that to make a new *.dblist show up [20:31:45] It just will [20:32:40] Does order matter? [20:32:50] If not, then just list directory [20:33:46] We're only doing this when we hit this code path when $globals cache isn't valid so the stat calls are limited. [20:34:51] BTW, unrelated- I can support the database config thingy [20:35:26] newer mysql version even have good json support for coplex configs [20:35:52] the main issue is caching on database error (keep using the last config) [20:36:31] and maybe transactionality- but that is not solved right now either [20:36:54] Well the MW bits were written in a way that the backend can be swapped out [20:37:09] So whether it's mysql, something like zookeeper or etcd....any could work in theory :) [20:37:25] Good to know you think that mysql is viable here [20:37:26] all of those have the same problems [20:37:32] PHP behaviour on bad state [20:37:39] it is a similar case [20:37:39] Oh yeah totally [20:39:19] we are trying to solve it for etcd for dynamic config [20:39:19] well, tim is [20:39:21] probably mysql would be ok for not-dynamic config, such as namespace configuration [20:39:24] which has lots and lots of lines on the config right now [20:40:06] Actually, what would be cool is allowing multiple sources for config--like allowing some puppet stuff so we didn't have to duplicate things would be kinda cool [20:40:20] (eg: you rename a mailserver, and because of the hiera values MW gets the update instantly) [20:40:39] puppet is not the best thing for deployments [20:40:58] (03CR) 10Thcipriani: [C: 031] "checked using this config in beta from deployment-tin using:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407174 (https://phabricator.wikimedia.org/T136839) (owner: 10Chad) [20:41:10] it is slow over multiple hosts, and we are actually trying to not use it for dynamic stuff [20:41:15] 10Operations, 10Analytics, 10Research, 10Traffic, and 6 others: Referrer policy for browsers which only support the old spec - https://phabricator.wikimedia.org/T180921#3941978 (10Tgr) @Nuria the changeset did not change Wikimedia referrer policy, just made MediaWiki able to do so. What I would like is to... [20:41:28] it is ok for things that can wait 30 minutes to be deployed [20:41:56] Yeah, it's eventually-consistent [20:42:05] very eventually [20:42:05] :-D [20:42:05] Anyway, these are all dreams of mine for years now [20:42:35] (03CR) 10Chad: [C: 032] "Pretty sure I had gotten rid of this forever ago." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407686 (owner: 10Jforrester) [20:42:52] (03Draft1) 10Paladox: gerrit: Switch its-phabricator configs from velocity to soy (closure template) [puppet] - 10https://gerrit.wikimedia.org/r/407753 (https://phabricator.wikimedia.org/T140366) [20:42:57] (03PS2) 10Paladox: gerrit: Switch its-phabricator configs from velocity to soy (closure template) [puppet] - 10https://gerrit.wikimedia.org/r/407753 (https://phabricator.wikimedia.org/T140366) [20:43:16] (03PS3) 10Paladox: gerrit: Switch its-phabricator configs from vm to closure template [puppet] - 10https://gerrit.wikimedia.org/r/407753 (https://phabricator.wikimedia.org/T140366) [20:44:01] I think a better pre-production will help too, and the posibility of configurable percentage-deployment [20:44:19] jynus: we think alike :) [20:44:55] (03CR) 10Chad: [C: 032] Expose a simple Swagger spec for checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407174 (https://phabricator.wikimedia.org/T136839) (owner: 10Chad) [20:45:11] OK, let's see how the swagger spec works in prod :) [20:45:24] jynus: Moar deployment checking in that one ^ ;-) [20:47:40] !log truncated /var/log/aphlict/aphlict.log to 1G (was 26G) to avoid overhead for the upcoming first logrotate [20:47:51] (on phab1001, amending..) [20:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:27] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [20:50:38] (03CR) 10Chad: [V: 032 C: 032] Gerrit 2.14.6 [software/gerrit] - 10https://gerrit.wikimedia.org/r/395820 (https://phabricator.wikimedia.org/T156120) (owner: 10Chad) [20:52:24] (03Merged) 10jenkins-bot: Expose a simple Swagger spec for checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407174 (https://phabricator.wikimedia.org/T136839) (owner: 10Chad) [20:53:02] (03CR) 10jenkins-bot: Expose a simple Swagger spec for checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407174 (https://phabricator.wikimedia.org/T136839) (owner: 10Chad) [20:54:02] !log demon@tin Synchronized docroot/wikipedia.org/spec.yaml: expose swagger spec (duration: 00m 56s) [20:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:33] thcipriani: Bam :) https://en.wikipedia.org/spec.yaml [20:55:20] beautiful! [20:56:16] no_justification: https://github.com/wikimedia/mediawiki-extensions-CommunityApplications <-- pls delete? [20:56:41] {{done}} [20:57:24] :D [20:57:29] another task I can now close [20:58:55] (03CR) 10C. Scott Ananian: [C: 031] Tidy: Re-do this as a sorted negative list that gets shorter over time [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407727 (owner: 10Jforrester) [21:00:04] no_justification, mutante, and paladox: #bothumor I � Unicode. All rise for Gerrit 2.14.6 deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180202T2100). [21:00:04] No GERRIT patches in the queue for this window AFAICS. [21:01:40] demon loves upgrading gerrit :P [21:01:54] * paladox is here :) [21:02:19] I guess gerrit is going to be down for a while right? [21:02:26] in that case I can delay some patches [21:02:38] Loves? Not the right word. [21:02:50] sarcasm bro [21:03:03] :-) [21:03:07] What is it? How does it work? Can you eat it? [21:03:20] But not bringing it down this second, got more prep work to finish [21:03:57] (03CR) 10Herron: [C: 032] add forward/reverse dns records for new debian stretch puppetdb VMs [dns] - 10https://gerrit.wikimedia.org/r/407768 (https://phabricator.wikimedia.org/T185499) (owner: 10Herron) [21:05:04] Zuul backlog kinda long right now, I'd rather not kill things while so many changes waiting on results [21:07:58] it's gonna take like an hour to kill the queue [21:08:08] s/kill/clear/ [21:11:14] no_justification: I think people are going to keep reviewing / uploading code while Gerrit is up :) [21:11:24] (I also can't wait for the new version) [21:17:55] legoktm: https://gerrit.wikimedia.org/r/#/c/406998/ ? :) [21:24:50] 10Operations, 10Puppet: Extend puppetmaster::puppetdb to support puppetlabs packaged puppetdb 4.4 - https://phabricator.wikimedia.org/T185500#3942137 (10herron) [21:26:59] 10Operations, 10Puppet: Extend puppetmaster::puppetdb to support puppetlabs packaged puppetdb 4.4 - https://phabricator.wikimedia.org/T185500#3942146 (10herron) https://gerrit.wikimedia.org/r/#/c/407492/ [21:27:13] 10Operations, 10Puppet, 10Patch-For-Review: Extend puppetmaster::puppetdb to support puppetlabs packaged puppetdb 4.4 - https://phabricator.wikimedia.org/T185500#3942147 (10herron) [21:27:27] (03PS2) 10Herron: puppetdb: add support for puppetlabs puppetdb 4.4 package [puppet] - 10https://gerrit.wikimedia.org/r/407492 (https://phabricator.wikimedia.org/T185500) [21:27:52] (03CR) 10jerkins-bot: [V: 04-1] puppetdb: add support for puppetlabs puppetdb 4.4 package [puppet] - 10https://gerrit.wikimedia.org/r/407492 (https://phabricator.wikimedia.org/T185500) (owner: 10Herron) [21:28:07] (03PS3) 10Zoranzoki21: puppetdb: add support for puppetlabs puppetdb 4.4 package [puppet] - 10https://gerrit.wikimedia.org/r/407492 (https://phabricator.wikimedia.org/T185500) (owner: 10Herron) [21:28:21] (03PS7) 10Andrew Bogott: openstack horizon: rough in manifests for source deploy of Horizon 'ocata' [puppet] - 10https://gerrit.wikimedia.org/r/406853 (https://phabricator.wikimedia.org/T168470) [21:28:32] (03CR) 10jerkins-bot: [V: 04-1] puppetdb: add support for puppetlabs puppetdb 4.4 package [puppet] - 10https://gerrit.wikimedia.org/r/407492 (https://phabricator.wikimedia.org/T185500) (owner: 10Herron) [21:28:50] (03CR) 10jerkins-bot: [V: 04-1] openstack horizon: rough in manifests for source deploy of Horizon 'ocata' [puppet] - 10https://gerrit.wikimedia.org/r/406853 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [21:29:38] (03CR) 10Zoranzoki21: [C: 04-1] "21:28:30 modules/puppetdb/manifests/app.pp:26 WARNING top-scope variable being used without an explicit namespace (variable_scope)" [puppet] - 10https://gerrit.wikimedia.org/r/407492 (https://phabricator.wikimedia.org/T185500) (owner: 10Herron) [21:33:47] PROBLEM - SSH cp1074.mgmt on cp1074.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:34:12] (03CR) 10Herron: [C: 04-2] "Not to be merged before 405808" [puppet] - 10https://gerrit.wikimedia.org/r/407492 (https://phabricator.wikimedia.org/T185500) (owner: 10Herron) [21:41:01] no_justification: here [21:41:59] !log bringing down gerrit for upgrade [21:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:13] :D [21:42:13] !log cobalt: disabling puppet so it doesn't restart gerrit [21:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:37] Derp. Shoulda pulled code to tin first bahah [21:43:45] Before I killed the service [21:43:52] heh [21:45:05] !log demon@tin Started deploy [gerrit/gerrit@98f5d9a]: Gerrit 2.14.6 [21:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:19] !log demon@tin Finished deploy [gerrit/gerrit@98f5d9a]: Gerrit 2.14.6 (duration: 00m 14s) [21:45:23] huh, it's back up :O [21:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:33] It'll be up and down for awhile [21:45:37] ah, ok [21:46:04] 10Operations, 10fundraising-tech-ops, 10monitoring: ssl monitoring: add civicrm.wikimedia.org to icinga - https://phabricator.wikimedia.org/T186328#3942204 (10RobH) a:05RobH>03Jgreen [21:46:35] Guys I broke Gerrit! [21:47:00] Getting 503s. And "Gerrit is down. We're working on bringing it back as soon as possible. Please follow along the discussion at #wikimedia-operations on freenode as we debug. Please try again later!" [21:47:20] Matthew_ hi, there's a gerrit upgrade in progress [21:47:25] Looks like no_justification (Chad) is deploying some changes to Gerrit and restarting. [21:47:46] Where do I track those changes? [21:47:51] paladox: which mailing list? [21:48:00] wikitech-l [21:48:12] Oh those emails have been blank for months. [21:48:14] ^^ [21:48:16] I don't bother with them anymore. [21:48:18] PROBLEM - SSH access on cobalt is CRITICAL: connect to address 208.80.154.85 and port 29418: Connection refused [21:48:27] blank? [21:48:32] Yep. [21:48:38] PROBLEM - Check systemd state on cobalt is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:48:48] PROBLEM - gerrit process on cobalt is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [21:49:49] is the first one ^ expected? [21:49:57] Sagan yep [21:50:01] legoktm: Screenshots incoming [21:50:03] Sagan it's gerrit.service [21:50:27] I don't need screenshots. [21:50:39] Things are going as expected. [21:50:41] Screenshots for blank emails? [21:50:45] Oh, meh [21:50:48] * no_justification ignores [21:50:56] That's what I'm sending screenshots of. [21:51:30] Matthew_, yes please. It should look like this, and is delivered in plaintext so shouldn't have any problems. https://lists.wikimedia.org/pipermail/wikitech-l/2018-January/089463.html [21:51:45] Message issues https://usercontent.irccloud-cdn.com/file/784ifXwL/Screen%20Shot%202018-02-02%20at%2014.49.44.png https://usercontent.irccloud-cdn.com/file/ljiHpRsS/Screen%20Shot%202018-02-02%20at%2014.49.49.png [21:51:56] quiddity: This isn't the first time this has happened. [21:52:38] RECOVERY - Check systemd state on cobalt is OK: OK - running: The system is fully operational [21:52:48] RECOVERY - gerrit process on cobalt is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [21:53:01] I've filed three mailing-list related bugs over the past year and all of them have been ignored or i've been told "Apple's fault" to which Apple has said "not our fault" so eh. [21:54:17] PROBLEM - puppet last run on db1102 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [21:54:21] 10Operations, 10fundraising-tech-ops, 10monitoring: ssl monitoring: add civicrm.wikimedia.org to icinga - https://phabricator.wikimedia.org/T186328#3942207 (10RobH) it was there and someone else (not me) acked it. not a good idea since im the one generating and purchasing certs ;] [21:54:26] Matthew_: which is you MUA? [21:54:40] Matthew_, Ok. I'll dig and let you know. I'll followup in #wikimedia-tech or PM (so as not to distract the fine folks here with tangential issues. :) [21:54:40] i kind of didnt ACK those on purpose [21:54:47] because then we see when things come back [21:54:48] PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] [21:54:53] re: gerrit [21:54:57] quiddity: Thankee. [21:55:00] but it [21:55:03] Platonides: MUA? [21:55:05] it's scheduled maintenance [21:55:13] Matthew_: Mail User Agent [21:55:18] RECOVERY - SSH access on cobalt is OK: SSH OK - GerritCodeReview_2.14.6-7-g55dde9d68b (SSHD-CORE-1.4.0) (protocol 2.0) [21:55:31] the program showing you the borked email [21:55:35] paladox: We were right, only had to offline reindex groups, other ones picked up online reindexer. [21:55:42] Platonides: Apple Mail.app. [21:55:43] no_justification :) [21:55:49] * paladox goes to polygerrit [21:56:00] is the mail fin on the server? [21:56:03] https://gerrit.wikimedia.org/r/?polygerrit=1 [21:56:05] how are you connecting to it? [21:56:08] PROBLEM - puppet last run on kafka1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [21:56:14] IMAP? [21:57:07] PROBLEM - puppet last run on db1095 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [21:57:08] legoktm does wikibugs reconnect back to gerrit? [21:57:16] Gerrit upgrade should be mostly done now. [21:57:25] DB migrated, puppet back on, plugins loaded. [21:57:27] :) [21:57:29] Watching logs now [21:57:34] Platonides: I believe so. Apple Auto-configured Gmail so whatever the default is. [21:57:35] that was quick! [21:57:38] no_justification mutante https://gerrit.wikimedia.org/r/c/385105/ [21:57:55] 10Operations, 10fundraising-tech-ops, 10monitoring: ssl monitoring: add civicrm.wikimedia.org to icinga - https://phabricator.wikimedia.org/T186328#3942209 (10RobH) 05Open>03declined [21:57:56] it's probably IMAP [21:58:37] PROBLEM - puppet last run on kafka2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [21:58:48] Yep. I would understand an odd download but all of my digest mailing lists are missing since December. [21:58:57] at least we get a better error message with https://gerrit.wikimedia.org/r/c/99101 [21:58:58] I would review the original email on Gmail [21:59:00] in polygerrit [21:59:05] Both of https://gerrit.wikimedia.org/r/q/topic:%25222.14-post-upgrade%2522+(status:open%20OR%20status:merged) [21:59:10] and check wether they are right or not [21:59:20] no_justification: merge time? [21:59:29] I mean, if it was mailman fault, it would be evident there [21:59:36] if it is a correctly formed MIME message [21:59:40] (as I would expect) [21:59:43] it's Apple's fault [21:59:58] PROBLEM - HTTP on releases1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:00:11] sometimes it passes through a server which mangles it [22:00:19] well, releases1001 is unexpected [22:00:22] looking [22:00:28] but I don't think it will be a problem with gmail [22:00:44] mutante: Yeah [22:00:45] * paladox has a user status now https://gerrit.wikimedia.org/r/q/status:open [22:00:46] :) [22:00:48] RECOVERY - HTTP on releases1001 is OK: HTTP OK: HTTP/1.1 200 OK - 14914 bytes in 0.010 second response time [22:00:49] (albeit they have some funny adaptations of IMAP) [22:00:55] self-healing [22:01:43] merges https://gerrit.wikimedia.org/r/#/c/407753/ (in lieue of bot message) [22:01:50] did wikibugs recover? [22:02:17] merges https://gerrit.wikimedia.org/r/#/c/385105/ [22:02:23] legoktm doin't think so [22:02:57] It's possibly hit the new api change [22:03:09] ie change number is not a string now. so we have to do it in python [22:03:50] no_justification: both are on master now [22:03:57] pulling, thx [22:04:05] [tools-login.wmflabs.org] out: Your job 649648 ("wb2-grrrrit") has been submitted [22:04:10] (Merged as f803c91 ) heh :) [22:04:12] wait, one more :p [22:04:20] Gerrit: Remove velocity templates but keep the ones for its-base [22:04:20] (03CR) 10Paladox: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/385105 (owner: 10Paladox) [22:04:24] yay works [22:04:42] and it's fast loading [22:04:43] paladox: you had a wikibugs change we needed to merge right? [22:04:54] legoktm nope that's for 2.15 :) [22:05:12] (03CR) 10Dzahn: [C: 032] Gerrit: Remove velocity templates but keep the ones for its-base [puppet] - 10https://gerrit.wikimedia.org/r/353693 (https://phabricator.wikimedia.org/T158008) (owner: 10Paladox) [22:05:22] ok [22:05:25] also how do I get the new UI? [22:05:34] now all 3 are merged. https://gerrit.wikimedia.org/r/#/q/topic:2.14-post-upgrade+(status:open+OR+status:merged) [22:05:42] :) [22:06:15] and i see wikibugs is back :) [22:06:46] :) [22:07:16] Note gerrit needs to be restarted for the changes to take effect [22:07:20] mutante no_justification ^^ [22:07:32] oh my ?polygerrit=1 [22:07:42] "gerrit: Ajust scap files (DO NOT MERGE)" ?:) [22:07:44] legoktm yep :) [22:07:52] I know [22:07:57] About restart [22:08:03] new ui?? [22:08:07] * apergos looks hopeful [22:08:07] Gerrit: Upgrading gerrit to 2.14.6-pre (DO NOT MERGE) [22:08:10] mutante that change dosen't need merging :) [22:08:11] ok thanks :) [22:08:12] apergos yep [22:08:13] it's also alot faster [22:08:15] wooooo [22:08:21] !log cobalt/gerrit2001: purged libbcprov-java libbcpkix-java, cleaned up old symlinks [22:08:32] apergos legoktm it will eventually look like https://docs.google.com/presentation/d/17q-ygGioZi_5DITLyELa8oaOr22e15AHy8cq6XTZ0nY/edit#slide=id.g27f16618ec_0_139 [22:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:51] hmm https://quarry.wmflabs.org/query/24534 [22:09:00] don't look right without use xyz_p; [22:09:08] that "eventually' sounds a bit far off, heh [22:09:16] Shouldddddd be the last restart now [22:09:25] apergos it's already happening :). in small pieces. [22:09:46] cool! [22:10:01] There's...some setting I want to change. It has to do with making startup times faster. [22:10:03] I should find it again [22:10:09] heh [22:10:29] paladox: so no more merges? [22:10:35] for today [22:10:43] mutante nope [22:10:45] httpd.reuseAddress [22:10:46] doin't think so [22:10:48] That was it [22:11:28] no_justification ah [22:11:31] we can set that [22:11:43] ok I'll need to restart wikibugs again [22:11:53] legoktm: That was the last time I promise [22:12:04] no worries :) [22:12:10] ack. "If true, permits the daemon to bind to the port even if the portis already in use. If false, the daemon ensures the port is notin use before starting. Busy sites may need to set this to trueto permit fast restarts." [22:12:37] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [22:13:07] just don't break my gerrit account again :P :) [22:13:16] restarted [22:13:31] no_justification hmm its-phab dosen't seem to be working. [22:13:49] Oh, did we not put the token tihngie in? [22:13:53] no_justification maybe [22:14:00] That's easy to fix [22:14:06] :) [22:14:06] Just needs to go in private repo then reload plug [22:14:09] *plugin [22:14:14] yep :) [22:14:19] eh, ok, what do you need in private repo [22:14:28] Ah yeah, lemme grab that token I guess :) [22:14:43] polygerrit's table view no longer has an option to show who voted in the CR/V columns [22:15:06] legoktm note polygerrit was experimental in 2.14 [22:15:14] in 2.15 it includes most of gwtui features [22:15:31] hola jynus [22:15:54] hm [22:16:01] I guess I shouldn't switch right away then? [22:16:12] You can, but ymmv :) [22:16:29] legoktm you can [22:16:52] Hey folks. I'm trying to check the puppet status of ores1001.eqiad.wmnet. Where's a good place to do that? [22:17:02] I suspect puppet failed. [22:17:08] But I can't log in to check. [22:17:28] halfak: icinga can tell you [22:17:32] i'll check though [22:17:33] halfak: Could've been transient from gerrit being down too. [22:17:38] (03CR) 10Paladox: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/353693 (https://phabricator.wikimedia.org/T158008) (owner: 10Paladox) [22:18:00] Relevant line from SAL: [2018-01-31T15:44:19Z] reimage ores100{1..9} [22:18:02] ores1001 is a Unused spare system (spare::system) [22:18:13] you can also vote on changes even after they are merged [22:18:23] mutante, oh. What does that mean? [22:18:42] halfak: puppet is running and works but all it does is install standard tools and packages [22:18:46] nothing ores specific [22:18:59] Oh I see. [22:19:08] It seems akosiaris was doing something here: https://gerrit.wikimedia.org/r/#/c/407018/ [22:19:27] Maybe he forgot to put it back in the stresstest role. [22:19:37] yea, so he did that on purpose per that comment [22:19:47] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:19:48] he is in the middle of reimaging it looks [22:20:00] and set it to "spare" temporarily to avoid complications [22:20:09] Gotcha. Looks like I'm blocked then. [22:20:16] to avoid false alerts in monitoring etc [22:20:24] which would be added by the ores role [22:20:27] before ores is up [22:20:39] afaict [22:21:29] looks very "in progress" [22:22:07] RECOVERY - puppet last run on db1095 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [22:23:37] RECOVERY - puppet last run on kafka2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:24:17] RECOVERY - puppet last run on db1102 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [22:25:33] mutante: We need to add passwords::gerrit::gerrit_phab_token to private repo's hiera or w/e. [22:25:59] The value is on cobalt:/root/private_repo.txt [22:26:08] RECOVERY - puppet last run on kafka1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:26:20] no_justification hmm, https://gerrit.wikimedia.org/r/c/384588/ returns 500 now [22:26:27] was working just a few mins ago [22:26:37] Didn't for me [22:26:45] Oh [22:27:15] https://phabricator.wikimedia.org/rOPUPe9226e2b5139c0cce4188b47cbc3eb4a52935eec [22:27:29] Can't open DB connrect :\ [22:27:37] oh [22:28:06] Getting 500 as well for everything [22:28:11] I get 500 [22:28:36] Yeah, where'd that come from? [22:29:09] not sure [22:29:14] Ah I see it [22:29:20] Online reindex - exhausting connections to DB [22:29:24] ah [22:29:53] "Note that it is not necessary to reindex the changes and accounts indexes offline. These will automatically be reindexed by the online reindexer after starting Gerrit." [22:29:57] 169920 tasks [22:30:04] that's alot heh [22:30:07] Yeah, we started so it started. [22:30:45] Falling quickly. Ok, it'll just all fail and go away [22:30:48] Should be ok [22:30:57] ok [22:31:00] So about 50% of my new Gerrit page loads have had internal server error banners or 500s [22:31:22] no_justification: passwords::gerrit::gerrit_phab_token added in private repo [22:31:28] Awesome thx [22:31:38] Known issue? [22:31:40] legoktm also about "polygerrit's table view no longer has an option to show who voted in the CR/V columns" what do you mean? [22:31:43] Matthew_ yep [22:31:50] * Matthew_ thumbs up [22:31:55] ~56k tasks left [22:31:56] Fun time. [22:31:59] lol [22:33:01] Ok, token applied, plugin should reload shortly [22:33:06] :) [22:33:06] dbproxy1002 is eventlogging db [22:33:37] RECOVERY - SSH cp1074.mgmt on cp1074.mgmt is OK: SSH OK - OpenSSH_5.8 (protocol 2.0) [22:33:50] PROBLEM - haproxy failover on dbproxy1002 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [22:33:55] elukey: ottomata[m]^ [22:34:27] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts] [22:34:27] PROBLEM - puppet last run on releases2001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/tools/release],Exec[git_pull_jenkins CI Composer] [22:34:50] ^ gerrit related [22:34:58] they just want to git clone, that's all [22:35:17] git push is failing, internal server error [22:35:19] Hmm Gerrit looks like it's back up but when I try to +2 a change it HTTP 500s [22:35:29] folks are working on it [22:35:34] actually not [22:35:39] RoanKattouw: uploading a patch isn't working either [22:35:39] reindexing is going on [22:35:41] it is misc services [22:35:49] jynus: is it gerrit? [22:35:55] it's reindex [22:35:59] it's trying to reindex online [22:36:02] it's exausting the db [22:36:05] and exhausting connections [22:36:06] could be [22:36:25] should we go back to offline reindex? [22:37:16] let's get a hold on to our hats then [22:37:23] when things go down we go to read only [22:37:58] jfc. [22:38:12] Or gerrit could be less batshit and not try to reindex everything [22:38:22] Offline reindex => offline for 3 hours. Online reindex => hammer the db to death [22:38:42] jynus: Can we get a *temporary* bump in connection limit just for a bit? [22:38:43] Trying to kill the storm [22:39:22] actually, I am going to limit the number of connections and point it back to the original server [22:39:34] dbs usually have a connection limit od 10000 [22:39:40] Ah ok, got it. [22:39:59] Yeah, this queuing ~60k jobs [22:40:00] otrs can be affected too [22:40:05] (it was like 150k) [22:40:07] and etherpad [22:40:27] I just had an issue with delayed CSS [22:40:47] no_justification: do you know the ip of the gerrit server? [22:40:51] online reindex but slow it down? [22:41:33] 208.80.154.85 and 208.80.154.81 [22:42:03] !log reloading m2 dbproxy [22:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:37] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [22:42:37] cobalt = host name gerrit = service name that's why 2 IPs [22:42:37] 38k [22:42:56] yeah, so gerrit had a 100 connection limit [22:42:57] RECOVERY - haproxy failover on dbproxy1002 is OK: OK check_failover servers up 2 down 0 [22:43:08] but it was removed [22:43:22] I have to add it, and actually, that would help to not break itself [22:43:43] mutante: We use defaults in this area afaict. We could possibly throttle it a bit? [22:43:52] no_justification: if you have access, you may have to reload the apache/webservice [22:43:54] But we're almost through the backlog, so it'd be for next time [22:43:59] no_justification also its-phab dosen't work still https://phabricator.wikimedia.org/T176164 . [22:44:05] paladox: Low priority [22:44:08] ok [22:44:13] jynus: Apache or the service? [22:44:26] although it is working for me [22:44:40] Down to 4 tasks [22:44:43] if needed, the service- I only mentioned apache if apache was the service [22:44:56] no_justification: indexing may have failed [22:45:01] paladox: let's find config options to slow down online reindex [22:45:03] Oh, it most certainly did. [22:45:04] as we were in read only for some time [22:45:05] mutante, sorry to step away for a sec. Thanks for taking a look at it with me. I'll ping Alex to make sure the reimaging continues next week ^_^ [22:45:08] But I don't care :D [22:45:08] mutante ok [22:45:15] I'm mad it tried to index so poorly. [22:45:20] can someone check otrs [22:45:22] halfak: you're welcome. sounds good :) [22:45:25] ? [22:45:29] I do not have access [22:45:34] Nor do I [22:45:37] just the web interface [22:45:39] paladox: Caused by: com.googlesource.gerrit.plugins.its.phabricator.conduit.ConduitErrorException: Method 'maniphest.edit' gave: ERR-INVALID-SESSION, Session key is not present. [22:45:50] no_justification hmm /me checks [22:45:52] that is why I am asking for someone else :-) [22:46:02] I don't even know what url it's at anymore :p [22:46:48] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Reimage ores* hosts with Debian Stretch - https://phabricator.wikimedia.org/T171851#3942290 (10Halfak) Just checked on the hosts and it seems they are still in progress. @akosiaris, are you still working on the re-imaging? [22:47:01] I am not 100% sure it was gerrit [22:47:18] it seems limited to 100 connections, which shouldn't make the db fail [22:47:31] unless they where creating huge number of writes, I guess? [22:47:43] tendril will tell us [22:48:12] oh [22:48:13] It was reindexing all 407k changes. [22:48:21] ALTER TABLE change_messages ADD real_author [22:48:26] was there an upgrade ongoing? [22:48:41] OTRS looks up from the cmd line perspective [22:48:55] i dont know about the web ui either, heh [22:48:56] jynus: Yes, the upgrade finished quite awhile ago, *then* the indexer ran [22:49:01] but something is runnign there [22:49:14] And the indexer running is when we started getting alerts from dbproxy [22:49:19] (the upgrade was long over) [22:49:20] it's ticket.wikimedia.org [22:49:27] mutante: yeah, it doesn't go down, but if it uses persistent connections, somtimes it takes a bit to redirect those [22:49:38] paladox: what the Asignee field is for? [22:49:39] I do not see it badly from mysql [22:49:42] so it should be ok [22:49:48] Hauskatze to assign the change to a reviewer [22:49:50] ok [22:49:54] ie to get someone to review the change [22:49:57] no_justification: I think your alters may have blocked normal traffic [22:50:12] not "your", I hope you understand what you mean [22:50:16] paladox: didn't 'reviewers' worked for that? [22:50:19] upgrade's [22:50:28] * no_justification nods [22:50:41] ping me for upgrades if you want- we dbas can do those fully online and monitor them [22:50:44] Hauskatze reviewers added reviewers. the assignee field assigns a change to a reviewer for it to be reviewed. [22:50:53] (03CR) 10Jforrester: [C: 031] "Nice work." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407152 (owner: 10Krinkle) [22:50:53] upgrades that contain schema changes, I mean [22:50:57] no_justification this works at least @ gerrit master so i must have done something wrong. [22:51:04] (@ its-phab) [22:51:32] (03CR) 10Krinkle: "No-op, but will wait for Monday just in case." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407152 (owner: 10Krinkle) [22:51:35] that likely created some pileups, that lead to the proxy think that the db was dead [22:52:21] but it was just for a small amout of time (a few seconds), as the other proxy didn't complain [22:52:47] Mhmmm [22:53:37] hm, zuul/jenkins doesn't want to test my changes any more? [22:53:58] most likly zuul is hitting the 500 [22:54:00] this is known, i guess? [22:54:22] well, ci was working before, but we seem to be hitting db connection issues due to auto reindex. [22:54:23] "Ideally, disk limit of this cache is large enough to cover all changes. This should significantly speed up change reindexing, especially full offline reindexing." [22:54:47] anyway https://gerrit.wikimedia.org/r/402437 needs to be poked once gerrit/zuul is happy again [22:55:07] or you could play with C+2'ing that and figuring out why zuul ignores it [22:55:19] everthing should be ok now, but some writes may have been lost between between 22:15 and 22:45 [22:56:34] https://integration.wikimedia.org/zuul/ is stick not picking up the 402437 job, even after un- and re-C+2ing it [22:56:36] jynus: nodepool/CI/jenkins is not working [22:56:50] and it looks like the CologneBlue test has been stuck in gate-and-submit for 40 min [22:56:50] It's working just fine. [22:56:56] It's communication with gerrit. [22:57:12] yeah, that could be delayed [22:57:23] sorry, I'm not privy to the inners, glad to know it's just a communication issue [22:57:24] as in, failing jobs executed 15 minutes ago [22:57:25] "index.batchThreads [22:57:25] Number of threads to use for indexing in background operations, such as online schema upgrades. [22:57:30] no_justification possibly due to it not connecting to phab side's it throws that session thingy. [22:57:45] * paladox try's something locally to test out thiery [22:57:45] yep, you should tune down that to 1 [22:57:57] I thought the default *was* one [22:58:02] in any case, if something is still weird, restart the service [22:58:04] paladox: Most likely, yes. [22:58:08] If not set or set to a negative value, defaults to the number of logical CPUs as returned by the JVM. [22:58:18] no_justification i have an idea so i am going to test that locally :) [22:58:31] My wiki is broken :-( [22:58:34] Each of those will run very quickly by themselves, so that'd be enough threads to overwhelm pretty quick [22:58:51] how many logical CPUs returned by the JVM [22:59:02] Looks like either gerrit isn't sending events through its stream to zuul, or zuul has lost the connection / unable to reconnect. [22:59:27] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:59:35] Krinkle most likly it's hitting 500 [22:59:37] from gerrit [22:59:46] unlikely, it doesn't use http. [22:59:59] Krinkle yes, but ssh. [23:00:04] mutante: can you restart otrs service? [23:00:09] Though yes zuul may need to be restarted [23:00:20] I still see some connection to the passive host by otrs [23:00:53] Maybe, but before we add more load on Gerrit, let's first make sure we know why Zuul isn't getting events. [23:01:01] actually, I can kill them, nevermind [23:01:03] It might be that it's already reconnecting, but Gerrit isn't sending anything yet. [23:01:23] eh, ok, am not sure how yet, but i can try [23:01:30] making a patch for gerrit [23:01:32] mutante: don't worry [23:01:35] solved [23:01:40] :) Ok [23:01:54] so, no database issues here [23:02:12] cool, thanks jynus [23:02:14] now, if gerrit is still broken, blame gerrit (or the upgrade) :-D [23:02:48] zuul/debug.log on contint1001 suggests the last event Zuul got from Gerrit was about Jenkins finishing the jobs for https://gerrit.wikimedia.org/r/#/c/407837/ and posting V+2, which it did. That was 22:47 [23:02:50] (UTC) [23:02:58] or we can turn off the entire online upgrading next time [23:03:00] "Whether to upgrade to new index schema versions while the server is running. This is recommended as it prevents additional downtime during Gerrit version upgrades" [23:03:07] And again as always [23:03:13] Log has been idle for past 5 minutes. [23:03:14] Any is not ok with zuul [23:03:21] (Zuul debug log, just waiting for input) [23:03:25] no_justification works for me [23:03:30] https://phab.wmflabs.org/T1 [23:03:59] Zoranzoki21: there isa Gerrit upgrade happening [23:04:06] https://phabricator.wikimediaorg/P6657 full stacktrace [23:04:10] git review is stuck for me, maybe queuing issues? [23:04:24] phab is sloww [23:04:25] it took some time [23:04:27] RECOVERY - puppet last run on releases2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [23:04:29] ah [23:04:32] wrong url [23:04:36] receovery shows that releaess server could clone [23:04:57] uploaded https://gerrit.wikimedia.org/r/407857 [23:05:05] check for web service overload or git pileups [23:05:07] using git-review [23:05:28] if you can replicate it, if not, don't worrry [23:05:56] (03CR) 10Paladox: [C: 031] gerrit: limit index batch threads to 1 [puppet] - 10https://gerrit.wikimedia.org/r/407857 (owner: 10Dzahn) [23:06:09] (03CR) 10Zoranzoki21: [C: 031] gerrit: limit index batch threads to 1 [puppet] - 10https://gerrit.wikimedia.org/r/407857 (owner: 10Dzahn) [23:06:11] bacula is working on it to save new files or something [23:06:32] but it's not extreme, java not using 100% cpu either [23:06:49] What's even wrong? [23:06:57] so you do not need me anymore, right? [23:06:58] wow the new web UI surprised :) [23:07:05] What's the complaint about gerrit? Gerrit is fine from its perspective. [23:07:09] no_justification: Mind if I restart zuul? Its various logs on contint1001 suggest it stopped receiving any events since about 5-10min ago. No errors about ssh or anything not working, or that it failed to reconnect. So seems safe to restart and see if it reconnects? [23:07:15] (other than the ITS bug, which I think is the issue with Zuul/CI) [23:07:16] no_justification: zuul i think [23:07:24] Krinkle: that sounds good! [23:07:25] Krinkle: Just restart the whole thing [23:07:26] zuul [23:07:32] no_justification damn, it includes the token in the log [23:07:33] yes please [23:07:34] you can polish rough edges if there is any, right? [23:07:35] when it fails auth [23:07:54] jynus: i think so yea :) [23:08:08] Okay, seems to be working now. [23:08:10] woork [23:08:12] yeaaa [23:08:42] paladox: Um, what? [23:08:55] I think the session is ok now. [23:09:03] no_justification i will share in a pm what it looks like (luckly i did the change from cert to token) [23:09:05] but, why gate-and-submit no work? [23:09:07] I wonder if it was trying to use some old session? [23:09:25] I'm not seeing the session bug anymore. [23:09:48] it could have [23:10:13] no_justification ah [23:10:20] maybe it's the session code i removed here [23:10:21] https://github.com/GerritCodeReview/plugins_its-phabricator/commit/88ccc3654d1ce87bd67414458f018a6d4f298c31#diff-51ddebacd5daa3cb93b4edbcf1429f5e [23:10:23] to reset it [23:11:24] I do wanna see this gate-and-submit job go through [23:11:27] Make sure it's still not failing on write [23:12:14] work [23:12:25] It's still in progress. [23:12:32] (also gerrit has a new host signiture by default) [23:12:39] uses edcsa [23:12:40] zuul look as to there is ddos [23:12:46] Aha [23:12:53] no_justification dosen't zuul ddos gerrit [23:12:57] when it carn't connect [23:13:02] Zoranzoki21: No there is no ddos on zuul [23:13:06] I know [23:13:17] There is no ddos. [23:13:20] Let's stop saying that :) [23:13:20] But see how much patches [23:13:22] ok [23:13:29] ok but see how much patches [23:13:36] * Zoranzoki21 laughing [23:13:37] sorry [23:13:46] That's called a backlog [23:13:48] Not a ddos. [23:13:51] * no_justification shakes head [23:14:51] paladox: Nope, session bug still bug. [23:14:56] no_justification ok [23:14:59] :\ [23:15:01] i will try fixing this thing [23:15:02] Hmmm [23:15:05] no_justification: ok [23:15:06] Ok off the wall question, is git review -d broken or am I daft? [23:15:15] Matthew_: yes, I think [23:15:26] git review is broken by its very existence [23:15:27] You think I'm daft? Kidding kidding, I think it is too. [23:15:31] Traveler:mediawiki-clean Matthew$ git review -d 400618 [23:15:31] Cannot query patchset information [23:15:31] The following command failed with exit code 104 [23:15:31] "GET https://gerrit.wikimedia.org/changes/?q=400618&o=CURRENT_REVISION" [23:15:31] ----------------------- [23:15:32] [23:15:32] [23:15:33] 404 Not Found [23:15:33] [23:15:34]

Not Found

[23:15:34]

The requested URL /changes/ was not found on this server.

[23:15:35] [23:15:36] aha [23:15:41] which version are you using? [23:15:43] Ugh, use a pastebin [23:15:47] this was fixed recently [23:15:48] Sorry wrong click. [23:15:57] I clicked "Messages" rather than "pastebin" in my haste. [23:16:07] Link is problem [23:16:07] also everyone needs to be on this version [23:16:15] (anyone who uses git-review) [23:16:29] 1.26.0 [23:16:36] 1.25.0 or lower is broken with gerrit 2.14 [23:16:38] I don't even know how I installed git review... let's see. [23:17:00] I use git review so it creates a change-id for me [23:17:20] That's the commit-msg hook [23:17:24] You don't need git-review for that [23:17:37] Matthew_: What does git remote -v show? [23:17:43] whatever, I don't use git push refs/for/branch [23:17:53] origin https://gerrit.wikimedia.org/r/p/mediawiki/core.git (fetch) [23:17:55] origin https://gerrit.wikimedia.org/r/p/mediawiki/core.git (push) [23:18:19] Right [23:18:19] that is unrelated to the upgrade and git-review works for me. it's not broken :) [23:18:27] in the version i get [23:18:54] That might break if you push but I think git review -d should work over HTTPS? [23:19:08] you can use ssh or https [23:19:12] Right [23:19:25] Roan, no... https://www.irccloud.com/pastebin/5W4Hxvm3/ [23:19:28] I was going to suggest changing that remote to SSH, but you'd think that HTTPS should work [23:19:38] if you use https to push you need to set the "http password" [23:19:41] Also, there is no git-review 1.26 on brew. [23:20:05] Matthew_: Try git remote set-url origin ssh://yourGerritShellUserName@gerrit.wikimedia.org:29418/mediawiki/core.git [23:20:13] i have 1.25.0-2 [23:20:24] only 1.25.0 has the issue [23:20:54] Worked. [23:21:01] So I need to use SSH to clone from now on? [23:21:14] Can any review all patches of Umherirrender? [23:21:19] you can do either, but i would use ssh because then you dont have to use that extra password [23:21:26] because jenkins did not it [23:22:14] Zoranzoki21: adding a comment "recheck" on it should make it do that [23:22:19] mutante: I know [23:22:20] now.. it hink [23:22:41] Gosh I despise Gerrit... [23:22:51] what's wrong? [23:23:14] git revert doesn't work for some reason, I'm having to go in via a GUI and clean up the mess... on a file that should not be changed at all! [23:23:30] :O https://gerrit.wikimedia.org/r/#/q/topic:requires+status:open [23:23:38] what mess though? [23:23:59] you are saying something broke that was ok before? [23:24:14] mutante: you said it to me? [23:24:26] to Matthew, but it kind of fits both :)( [23:24:28] Yes, git-review somehow set my repository into an un-git-able state. [23:24:36] I'm force reverting everything. [23:24:49] git-review is a third party tool [23:24:58] not sure i follow, i used it without issues [23:25:04] * Matthew_ shrugs [23:25:16] and that wrong URL wouldnt mean you have to revert anything [23:25:36] DUnno what happened. But a git pull followed by a git review -d should not leave me with an error message about shelving commits on a clean repository. [23:27:37] Depends on your merge strategy and what was merged. Dropping submodules leaves dirty repos for example [23:27:38] * no_justification shrugs [23:27:54] nope, last time this was used was submitting my last patch. I didn't update or anything. [23:28:29] Just changed the URL (cuz I got the 404 before) and then did the git review -d. Errors about merged files, so git pull. I've got 2300 uncommited changed files and merge errors on two. I did nothing. [23:29:15] Never mind, I'll just blow it away and try again. [23:29:24] if there was no edited content [23:29:27] git reset --hard [23:29:29] and try again [23:29:50] That's the kicker, I never edited anything. But yes, I'm just cloning fresh. [23:29:54] that can happen when merges fail midway [23:30:14] does git review execute a merge somehow? That may have broken it. [23:31:17] rather than merges, I think it's the fetch part of the pull [23:31:35] Hm. Maybe. [23:31:38] I just have seen this before [23:31:39] It would make sense. [23:31:57] without gerrit or git-review [23:32:39] I know nothing was changed because I have a habit of having a "-clean" copy of the repository (my work uses SVN and that's how I commit... different discussion) and that clean repository is never edited. That's what I'm trying to use. But eh, a clean clone should fix it. [23:33:48] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 25 probes of 289 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [23:37:16] I miss svn [23:37:42] i don't miss merging branches in SVN [23:37:50] though it sure beat CVS [23:37:57] there's a branch other than trunk? [23:38:18] i can only... speculate [23:38:57] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 24 probes of 289 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [23:39:06] We branch per release, so somebody out there is merging from branch to trunk to other branch. It's... hectic. [23:39:21] brion: In retrospect, we shoulda just kept a giant git repo for all MW + extensions :) [23:39:31] Except repo sizes are atrocious :) [23:39:44] supercalifragilisticverybiggitrepo [23:40:07] lol [23:40:30] !log gerrit: one last restart to try and force gerrit/phab session restart [23:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:00] Error 503 [23:41:11] Zoranzoki21 we are restarting gerrit [23:41:15] .. again :) [23:41:19] to try and fix T186370 [23:41:20] T186370: its-phabricator seems to be broken on gerrit 2.14 - https://phabricator.wikimedia.org/T186370 [23:41:23] Krinkle T186370 [23:41:24] Zoranzoki21: ahh again [23:41:26] Does that mean zuul's queue will get stuck again and restart too? [23:41:37] I guess we'll find out :) [23:41:38] Zuul reconnects if memory is right :) [23:41:44] Hopefully it's quick enough that it doesn;'t notice :) [23:41:49] Krinkle: some stroopwaffels in the meantime [23:42:52] yay [23:42:53] no_justification ^^ [23:42:55] it works [23:42:55] https://phabricator.wikimedia.org/T176164 [23:43:17] Hmmm [23:43:19] Seems so? [23:43:36] Yep [23:43:37] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_statistics_mediawiki] [23:43:47] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 12 probes of 289 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [23:43:53] Stupid gerrit [23:43:57] Yay all fixed. [23:43:57] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 11 probes of 289 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [23:44:08] :) cool! [23:45:16] This day will be unforgettable for ci, zuul and gerrit servers [23:45:34] I'll forget it by tomorrow. [23:45:47] no_justification im now pretty sure that the change i removed, just means we have to restart gerrit and restarting the plugin won't affect it changing. [23:45:52] (sessionKey) [23:46:35] Zoranzoki21: This was a boring upgrade mostly. Zuul/Gerrit have always had trouble talking after upgrades. We usually have to whack it with a hammer a few times. [23:46:47] You weren't here the year we discovered we had to offline reindex hundreds of thousands of changes. [23:46:49] Hammers - the ideal solution. [23:46:51] Gerrit was down for 4 hours [23:47:40] And not just flappy up/down with zuul being weird [23:47:47] But fully down and couldn't hit the service at all [23:47:51] OMG [23:47:59] How much patches in list in my changes [23:48:19] lol [23:48:23] it's the same as gwtui [23:48:35] just you now have the newest first [23:48:36] at the top [23:48:39] this is a scheduled maintenance that is well inside the announced window [23:49:07] :) [23:49:18] mutante: I think 2.8 => 2.11 was probably the roughest one. [23:49:29] heh, yea [23:49:34] So was 2.7 -> 2.8, iirc [23:49:37] heh [23:49:38] Platonides: I fixed the change by the way, 400618? The one you commented on, I got it rebased with master. [23:49:42] We've been using it since what, 2.4? [23:49:43] Ish? [23:49:44] :) [23:49:49] lol [23:49:54] (That took 40 minutes because Gerrit and git-review) [23:49:54] I see all this in my list (including next pages): https://gerrit.wikimedia.org/r/#/q/owner:umherirrender_de.wp%2540web.de+status:open [23:50:10] yeh? [23:50:13] :lol: [23:51:58] no_justification doin't you mean 2.8 -> 2.12 :) [23:52:18] Erm yeah [23:52:20] That [23:52:52] All-Users i expect got big :) [23:53:27] no_justification did you see my user status in gerrit? [23:53:36] Nope :) [23:53:46] no_justification https://gerrit.wikimedia.org/r/q/owner:%2522Paladox+%253Cthomasmulhall410%2540yahoo.com%253E%2522 [23:53:58] (you can only see it in polygerrit) [23:54:05] you can also name your patches [23:54:09] one i named rebase [23:54:43] Hah, really? [23:54:55] yeh [23:54:58] no_justification try it :) [23:55:01] it's really fun [23:55:15] (also a polygerrit feature) [23:55:52] Hah [23:55:57] That's funny [23:56:00] yep :) [23:56:16] no_justification also you can vote on merged changes too though i have no idea why someone would [23:56:39] "-2 this was a terrible idea wtf were you thinking?" [23:56:44] 135 changes only and 408k coming [23:57:33] lol [23:58:42] paladox: https://gerrit.wikimedia.org/r/q/owner:dzahn ? [23:59:00] no_justification https://gerrit.wikimedia.org/r/c/407865/ [23:59:07] ha [23:59:10] you found it :) [23:59:34] i'll change it back though, don't want people to think i laugh at their code :) [23:59:43] lol [23:59:55] back to 500s from Gerrit.