[00:01:34] (03PS2) 10Dzahn: DNS: Remove mgmt DNS entries for ms-be20[0-1[1-9] [dns] - 10https://gerrit.wikimedia.org/r/361682 (owner: 10Papaul) [00:14:33] (03CR) 10Dzahn: [C: 032] DNS: Remove mgmt DNS entries for ms-be20[0-1[1-9] [dns] - 10https://gerrit.wikimedia.org/r/361682 (owner: 10Papaul) [00:20:02] 10Operations, 10Performance-Team, 10Patch-For-Review: webpagetest-alerts: Difference in size authenticated - https://phabricator.wikimedia.org/T164209#3384858 (10Dzahn) CRITICAL: https://grafana.wikimedia.org/dashboard/db/webpagetest-alerts is alerting: Difference in JS size mobile authenticated [ALERT] al... [00:20:28] 10Operations, 10Performance-Team, 10Patch-For-Review: webpagetest-alerts: Difference in size authenticated - https://phabricator.wikimedia.org/T164209#3384859 (10Dzahn) 05Resolved>03Open [00:23:32] ACKNOWLEDGEMENT - Apache HTTP on mw2148 is CRITICAL: connect to address 10.192.32.36 and port 80: Connection refused daniel_zahn https://phabricator.wikimedia.org/T168881 [00:23:32] ACKNOWLEDGEMENT - Check HHVM threads for leakage on mw2148 is CRITICAL: NRPE: Command check_check_leaked_hhvm_threads not defined daniel_zahn https://phabricator.wikimedia.org/T168881 [00:23:32] ACKNOWLEDGEMENT - Check systemd state on mw2148 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T168881 [00:23:32] ACKNOWLEDGEMENT - HHVM processes on mw2148 is CRITICAL: NRPE: Command check_hhvm not defined daniel_zahn https://phabricator.wikimedia.org/T168881 [00:23:32] ACKNOWLEDGEMENT - HHVM rendering on mw2148 is CRITICAL: connect to address 10.192.32.36 and port 80: Connection refused daniel_zahn https://phabricator.wikimedia.org/T168881 [00:23:32] ACKNOWLEDGEMENT - Nginx local proxy to apache on mw2148 is CRITICAL: connect to address 10.192.32.36 and port 443: Connection refused daniel_zahn https://phabricator.wikimedia.org/T168881 [00:23:34] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2148 is CRITICAL: Host mw2148 is not in mediawiki-installation dsh group daniel_zahn https://phabricator.wikimedia.org/T168881 [00:24:08] 10Operations, 10ops-codfw, 10Performance-Team, 10Thumbor, 10User-fgiunchedi: Rename mw2148 / mw2149 / mw2259 / mw2260 to thumbor200[1234] - https://phabricator.wikimedia.org/T168881#3379407 (10Dzahn) ACKed these https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=mw2148 [00:27:33] 10Operations, 10Mail, 10fundraising-tech-ops: (re)move problemsdonating aliases - https://phabricator.wikimedia.org/T127488#3384869 (10Dzahn) Ok, cool. Maybe add ZenDesk ticket number. Please remove from privateexim module in private repo once OIT has them confirmed on their side. [00:29:56] (03CR) 10Dzahn: [C: 031] package_builder: Make init.pp compatible with stretch [puppet] - 10https://gerrit.wikimedia.org/r/361698 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [00:31:46] (03CR) 10Dzahn: [C: 031] "seems right to me, but hashar please look too" [puppet] - 10https://gerrit.wikimedia.org/r/361680 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [00:32:25] (03CR) 10Dzahn: "i think you said yourself in this one the subscriber list is missing completely" [puppet] - 10https://gerrit.wikimedia.org/r/361190 (owner: 10Paladox) [00:33:34] (03CR) 10Dzahn: [C: 031] "per "trusty had the same thing, but even 90 seconds, not just 5"" [puppet] - 10https://gerrit.wikimedia.org/r/360876 (owner: 10RobH) [00:35:47] (03CR) 10Dzahn: "actually, i DO see the feed list at the bottom, but i think we have to fine-tune this and set a different limit how many feeds it includes" [puppet] - 10https://gerrit.wikimedia.org/r/361190 (owner: 10Paladox) [00:36:57] (03CR) 10Dzahn: [C: 031] Gerrit: Enable logstash by default for prod gerrit [puppet] - 10https://gerrit.wikimedia.org/r/332531 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [00:39:51] (03PS1) 10Chad: releaes1001: Also duplicate most of the microsite setup on new host [puppet] - 10https://gerrit.wikimedia.org/r/361803 [00:43:34] (03PS1) 10Chad: releases1001: Introduce reprepo profile based on microsite [puppet] - 10https://gerrit.wikimedia.org/r/361804 [00:44:07] (03CR) 10Chad: "This plus its parent should basically have releases1001 acting like the previous microsite. Once we land, can test, then copy data and swa" [puppet] - 10https://gerrit.wikimedia.org/r/361804 (owner: 10Chad) [00:51:27] (03CR) 10Chad: "Plus updating anything that points to the old host, but perhaps that can be hiera-ized?" [puppet] - 10https://gerrit.wikimedia.org/r/361804 (owner: 10Chad) [01:02:48] PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 [01:02:48] PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6380 [01:02:58] PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 [01:02:58] PROBLEM - Check health of redis instance on 6381 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6381 [01:02:59] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6480 [01:03:48] RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8646662 keys, up 3 minutes 39 seconds - replication_delay is 0 [01:03:48] RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 8647791 keys, up 3 minutes 39 seconds - replication_delay is 0 [01:03:48] RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8651942 keys, up 3 minutes 44 seconds - replication_delay is 0 [01:03:48] RECOVERY - Check health of redis instance on 6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 8555595 keys, up 3 minutes 44 seconds - replication_delay is 0 [01:03:58] RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 3945568 keys, up 3 minutes 54 seconds - replication_delay is 0 [01:19:33] (03PS1) 10Chad: Swap reprepro upload role to profile [puppet] - 10https://gerrit.wikimedia.org/r/361805 [01:34:58] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [01:36:58] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [01:36:58] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:36:58] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [01:39:58] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [01:40:18] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [01:42:58] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:43:36] !log demon@tin Synchronized README: profiling (duration: 00m 47s) [01:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:43:58] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:44:18] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:44:58] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:51:59] (03PS2) 10Dzahn: releases: Duplicate most of the microsite setup on new host releases1001 [puppet] - 10https://gerrit.wikimedia.org/r/361803 (owner: 10Chad) [02:04:30] (03CR) 10Dzahn: [C: 032] releases: Duplicate most of the microsite setup on new host releases1001 [puppet] - 10https://gerrit.wikimedia.org/r/361803 (owner: 10Chad) [02:11:43] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10Security-General: setup releases1001.eqiad.wmnet (was: setup mwreleases1001) - https://phabricator.wikimedia.org/T164030#3384928 (10Dzahn) "releases: Duplicate most of the microsite setup on new host releases1001" (Chad) - https://ger... [02:13:39] (03PS2) 10Dzahn: releases1001: Introduce reprepo profile based on microsite [puppet] - 10https://gerrit.wikimedia.org/r/361804 (owner: 10Chad) [02:16:04] (03CR) 10Dzahn: [C: 032] releases1001: Introduce reprepo profile based on microsite [puppet] - 10https://gerrit.wikimedia.org/r/361804 (owner: 10Chad) [02:24:15] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.6) (duration: 07m 55s) [02:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:40:33] (03PS2) 10Dzahn: Swap reprepro upload role to profile [puppet] - 10https://gerrit.wikimedia.org/r/361805 (owner: 10Chad) [02:42:48] (03CR) 10Dzahn: [C: 032] Swap reprepro upload role to profile [puppet] - 10https://gerrit.wikimedia.org/r/361805 (owner: 10Chad) [02:43:32] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10Security-General: setup releases1001.eqiad.wmnet (was: setup mwreleases1001) - https://phabricator.wikimedia.org/T164030#3384973 (10Dzahn) "Swap reprepro upload role to profile" (Chad) - https://gerrit.wikimedia.org/r/#/c/361805/ [02:58:57] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.7) (duration: 14m 50s) [02:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:58] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Jun 28 03:05:57 UTC 2017 (duration 7m 0s) [03:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:31:58] (03PS1) 10Chad: Create rsync::quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/361811 [03:32:00] (03PS1) 10Chad: Use rsync::quickdatacopy for copying data between old-new release servers [puppet] - 10https://gerrit.wikimedia.org/r/361812 [03:33:17] (03CR) 10jerkins-bot: [V: 04-1] Use rsync::quickdatacopy for copying data between old-new release servers [puppet] - 10https://gerrit.wikimedia.org/r/361812 (owner: 10Chad) [03:33:28] (03PS2) 10Chad: Create rsync::quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/361811 [03:34:41] (03PS2) 10Chad: Use rsync::quickdatacopy for copying data between old-new release servers [puppet] - 10https://gerrit.wikimedia.org/r/361812 [03:36:06] (03CR) 10Chad: "I can't tell you how many times I've written these same basic pieces of puppet code." [puppet] - 10https://gerrit.wikimedia.org/r/361811 (owner: 10Chad) [03:36:15] (03CR) 10jerkins-bot: [V: 04-1] Use rsync::quickdatacopy for copying data between old-new release servers [puppet] - 10https://gerrit.wikimedia.org/r/361812 (owner: 10Chad) [04:39:44] PROBLEM - nova-compute process on labvirt1011 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [04:40:44] RECOVERY - nova-compute process on labvirt1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [05:23:25] (03PS1) 10Marostegui: Revert "db-eqiad.php: Add comments to db1033" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361815 [05:23:32] (03PS2) 10Marostegui: Revert "db-eqiad.php: Add comments to db1033" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361815 [05:24:48] !log Stop MySQL and reboot db1034 for maintenance - T166208 [05:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:59] T166208: Convert unique keys into primary keys for some wiki tables on s7 - https://phabricator.wikimedia.org/T166208 [05:25:08] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Add comments to db1033" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361815 (owner: 10Marostegui) [05:26:10] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Add comments to db1033" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361815 (owner: 10Marostegui) [05:27:35] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Remove comments from db1033 status - T166208 (duration: 00m 47s) [05:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:14] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:43:14] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:43:14] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:43:24] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:44:04] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:44:04] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:44:04] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [05:44:14] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:44:22] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3385053 (10Liuxinyu970226) [05:46:32] ^ those are the backups probably [05:46:36] I will sience it [05:55:08] !log Temporarily disable event scheduler on dbstore2001 - https://phabricator.wikimedia.org/T168354 [05:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:33] (03PS2) 10Giuseppe Lavagetto: Add pypi classifiers [software/conftool] - 10https://gerrit.wikimedia.org/r/343622 [06:03:14] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=2298.00 Read Requests/Sec=409.20 Write Requests/Sec=2.90 KBytes Read/Sec=44226.40 KBytes_Written/Sec=99.20 [06:03:17] (03CR) 10Giuseppe Lavagetto: [C: 032] Add pypi classifiers [software/conftool] - 10https://gerrit.wikimedia.org/r/343622 (owner: 10Giuseppe Lavagetto) [06:15:14] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=9.50 Read Requests/Sec=22.70 Write Requests/Sec=58.50 KBytes Read/Sec=384.40 KBytes_Written/Sec=849.60 [06:32:34] (03CR) 10Alexandros Kosiaris: "thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/361798 (owner: 10Dzahn) [06:34:56] !log Stop Replication in sync on db2033 and dbstore2001 (x1) - T168354 [06:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:05] T168354: dbstore2001: s5 thread isn't able to catch up with the master - https://phabricator.wikimedia.org/T168354 [06:35:11] !log executed sudo -u _graphite find /var/lib/carbon/whisper/eventstreams/rdkafka -type f -mtime +10 -delete on graphite1001 to free space [06:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:44] (03PS1) 10Elukey: graphite: lower down eventstreams whisper files retention [puppet] - 10https://gerrit.wikimedia.org/r/361818 (https://phabricator.wikimedia.org/T160644) [06:37:37] !log restart pdfrender.service on scb1003 - xpra race condition [06:37:43] (03PS2) 10Alexandros Kosiaris: Expose mediawiki.revision-create stream from eventstreams. [puppet] - 10https://gerrit.wikimedia.org/r/361497 (https://phabricator.wikimedia.org/T167670) (owner: 10Ppchelko) [06:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:51] (03CR) 10Alexandros Kosiaris: [C: 032] Expose mediawiki.revision-create stream from eventstreams. [puppet] - 10https://gerrit.wikimedia.org/r/361497 (https://phabricator.wikimedia.org/T167670) (owner: 10Ppchelko) [06:37:54] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [06:39:34] RECOVERY - Check systemd state on thumbor2001 is OK: OK - running: The system is fully operational [06:39:36] (03CR) 10Alexandros Kosiaris: [C: 031] Change mailman DEFAULT_DMARC_MODERATION_ACTION to 1 (munge from) [puppet] - 10https://gerrit.wikimedia.org/r/361685 (https://phabricator.wikimedia.org/T168467) (owner: 10Herron) [06:40:14] RECOVERY - Check systemd state on mw2148 is OK: OK - running: The system is fully operational [06:42:59] !log stop jobrunner/jobchron on mw130[2,3] and reboot them for kernel updates [06:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:34] PROBLEM - Check systemd state on thumbor2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:46:14] PROBLEM - Check systemd state on mw2148 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:46:32] godog: puppet.service doesn't start on thumbor2001 [06:48:54] PROBLEM - MariaDB Slave Lag: x1 on db2033 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 620.00 seconds [06:49:06] elukey: puppet as a service? should be blocked by config, you sure it doesn't just need a reset failed? [06:49:51] volans: I have no idea, I didn't check but I thought it was something running the agent [06:50:02] i downtimed db2033... [06:50:09] I think.. [06:50:14] lol [06:50:19] volans: yeah it does ExecStart=/usr/bin/puppet agent $DAEMON_OPT [06:50:44] i really think i did! [06:52:05] marostegui: ok if I restart my alter table journey on db1047? :) [06:52:12] sure! :) [06:52:17] make sure you downtime it :) [06:52:23] elukey: that's usually normal on reimage, it tries it, but we ship "daemonize = false" in the config and we don't run it as a service, so stopped/failed is correct, just reset-fail-it and you're good to go ;) [06:52:40] volans: good to know, thanks :) [06:52:41] I'm wondering why it alarmed now though [06:53:07] (03CR) 10Alexandros Kosiaris: [C: 032] package_builder: Make init.pp compatible with stretch [puppet] - 10https://gerrit.wikimedia.org/r/361698 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [06:53:12] (03PS3) 10Alexandros Kosiaris: package_builder: Make init.pp compatible with stretch [puppet] - 10https://gerrit.wikimedia.org/r/361698 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [06:53:16] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] package_builder: Make init.pp compatible with stretch [puppet] - 10https://gerrit.wikimedia.org/r/361698 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [06:54:34] RECOVERY - Check systemd state on thumbor2001 is OK: OK - running: The system is fully operational [06:54:45] there you go --^ [06:54:48] elukey: something was trying to start it: [06:54:49] Jun 28 06:39:20 thumbor2001 puppet-agent[21215]: Starting Puppet client version 3.8.5 [06:55:15] RECOVERY - Check systemd state on mw2148 is OK: OK - running: The system is fully operational [06:55:23] Jun 28 06:39:18 thumbor2001 systemd[1]: Starting Puppet agent... [06:55:42] yes I tried to restart it but it failed, my bad [06:56:11] so that log is a PEBKAC from my side [06:56:20] I'm not sure I follow, you didn't react the alarm? [06:56:27] s/react/react to/ [06:56:55] or you where there before :D [06:57:33] there was a error in icinga, IIRC with 5d of elapsed time, I checked on the host and saw the failed unit for puppet.service [06:57:48] anyway seems clearly an EMISSINGCOFFEE to me ;) [06:58:04] ahhh got i [06:58:06] it [06:58:34] nono for some reason I didn't think about the non-sense of the unit and tried to restart it [06:58:42] so worse than EMISSINGCOFFEE [06:58:44] :D [06:58:54] buuuut new thing learned [06:59:33] in the meantime, mw130[2,3] back in service [07:01:07] !log stop jobrunner/jobchron on mw130[4,5,6] and reboot them for kernel updates [07:01:11] (last ones) [07:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:54] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr1-ulsfo:xe-1/2/0 (Telia, IC-313592, 51ms) {#11372} [10Gbps wave]BR [07:07:14] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqord:xe-0/0/1 (Telia, IC-313592, 51ms) {#1502} [10Gbps wave]BR [07:08:34] did we just loose eqord-ulsfo? checking calendar [07:10:21] * volans not finding anything Telia-related on maintenance calendar [07:11:16] XioNoX: ^^^ (I'm not sure in which TZ are you right now, sorry if it's late/early ;) ) [07:11:23] I'm here [07:11:41] There were some emails about a maintenance being completed already from Telia [07:12:18] marostegui: there was a maintenance in the calendar? I cannot see it [07:12:32] different circuit ID but, I wouldn't be surprised if it's related [07:14:06] the maintenance was for eqord<->eqiad, and the circuit down is eqord<->ulsfo [07:14:26] oh [07:14:27] Status update: The planned work is about to start. Planned Work PWIC77080 from Telia Carrier to Wikimedia Foundation, Inc., 2017-Jun-28 07:00 - 2017-Jun-28 13:00 UTC [07:14:42] Service ID: IC-313592 [07:14:42] Impact: 1 x 360 minutes interruption [07:14:44] boom [07:14:57] there you go! :) [07:15:28] 6h... [07:16:05] downtimed in icinga [07:16:54] and good morning :) [07:17:42] and don't hesitate to ping me, whatever timezone I'm on [07:19:07] Awesome, that email just arrived and the maintenance was started before the email [07:19:10] nice [07:29:54] RECOVERY - MariaDB Slave Lag: x1 on db2033 is OK: OK slave_sql_lag Replication lag: 0.12 seconds [07:32:22] (03PS4) 10Giuseppe Lavagetto: cassandra::instance: allow defining multiple data directories [puppet] - 10https://gerrit.wikimedia.org/r/361673 [07:33:35] !log Re-enable event scheduler on dbstore2001 - T168354 [07:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:44] T168354: dbstore2001: s5 thread isn't able to catch up with the master - https://phabricator.wikimedia.org/T168354 [07:46:18] (03PS14) 10Jcrespo: mariadb: refactor option support and move it to hiera [puppet] - 10https://gerrit.wikimedia.org/r/361456 (https://phabricator.wikimedia.org/T148507) [07:47:17] (03CR) 10Filippo Giunchedi: [C: 031] graphite: lower down eventstreams whisper files retention [puppet] - 10https://gerrit.wikimedia.org/r/361818 (https://phabricator.wikimedia.org/T160644) (owner: 10Elukey) [07:47:41] (03PS2) 10Elukey: graphite: lower down eventstreams whisper files retention [puppet] - 10https://gerrit.wikimedia.org/r/361818 (https://phabricator.wikimedia.org/T160644) [07:49:34] !log disable puppet on all database hosts for deployment of gerrit:361456 [07:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:20] (03CR) 10Elukey: [C: 032] graphite: lower down eventstreams whisper files retention [puppet] - 10https://gerrit.wikimedia.org/r/361818 (https://phabricator.wikimedia.org/T160644) (owner: 10Elukey) [07:50:37] elukey: ^ nice, thanks! [07:51:08] (03CR) 10Filippo Giunchedi: "ping?" [puppet] - 10https://gerrit.wikimedia.org/r/358910 (https://phabricator.wikimedia.org/T1075) (owner: 10Filippo Giunchedi) [07:52:11] (03CR) 10Jcrespo: [C: 032] mariadb: refactor option support and move it to hiera [puppet] - 10https://gerrit.wikimedia.org/r/361456 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [07:52:22] (03PS15) 10Jcrespo: mariadb: refactor option support and move it to hiera [puppet] - 10https://gerrit.wikimedia.org/r/361456 (https://phabricator.wikimedia.org/T148507) [07:52:41] (03PS3) 10Alexandros Kosiaris: Renumber 4 VMs to public1-c-eqiad [dns] - 10https://gerrit.wikimedia.org/r/361662 [07:52:43] (03PS3) 10Alexandros Kosiaris: Renumber install1002.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/361663 [07:52:45] (03PS3) 10Alexandros Kosiaris: Bump the TTLs again after renumbering [dns] - 10https://gerrit.wikimedia.org/r/361664 [07:52:47] (03PS1) 10Alexandros Kosiaris: Lower TTL for lists.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/361822 [07:52:56] (03CR) 10jerkins-bot: [V: 04-1] Renumber 4 VMs to public1-c-eqiad [dns] - 10https://gerrit.wikimedia.org/r/361662 (owner: 10Alexandros Kosiaris) [07:52:58] (03CR) 10jerkins-bot: [V: 04-1] Renumber install1002.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/361663 (owner: 10Alexandros Kosiaris) [07:54:26] (03PS2) 10Alexandros Kosiaris: Lower TTL for lists.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/361822 [07:54:31] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Lower TTL for lists.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/361822 (owner: 10Alexandros Kosiaris) [07:58:14] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:00:44] PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:01:21] jynus: those might be yours ^^^ :) [08:03:44] PROBLEM - puppet last run on db2016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:03:54] yeah [08:09:14] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:09:54] PROBLEM - puppet last run on labsdb1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:10:14] PROBLEM - puppet last run on labsdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:10:34] PROBLEM - puppet last run on dbstore1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:13:18] 10Operations, 10ops-esams: bast3002 sdb broken - https://phabricator.wikimedia.org/T169035#3385254 (10fgiunchedi) [08:14:54] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:14:54] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:15:04] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:17:24] (03PS1) 10Jcrespo: mariadb: Followup to gerrit:361456 [puppet] - 10https://gerrit.wikimedia.org/r/361824 (https://phabricator.wikimedia.org/T168356) [08:19:17] (03CR) 10Jcrespo: [C: 032] mariadb: Followup to gerrit:361456 [puppet] - 10https://gerrit.wikimedia.org/r/361824 (https://phabricator.wikimedia.org/T168356) (owner: 10Jcrespo) [08:20:05] PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:20:14] PROBLEM - puppet last run on labsdb1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:20:44] RECOVERY - puppet last run on db2016 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [08:21:04] 10Operations, 10Wikibase-DataModel, 10Wikidata, 10Patch-For-Review, 10Wikidata-Sprint: Remove left-over alias for wikidata.org/ontology (doesn't work) - https://phabricator.wikimedia.org/T169023#3385320 (10Lydia_Pintscher) [08:22:14] RECOVERY - puppet last run on labsdb1010 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [08:23:12] volans: if you have a moment could you take a look at https://gerrit.wikimedia.org/r/#/c/361648/ and https://gerrit.wikimedia.org/r/#/c/361647/ ? [08:23:32] godog: sure [08:23:44] thanks! [08:26:26] godog: although I have almost zero context :-P [08:27:14] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [08:27:41] volans: heh the tl;dr is that disks might get renumbered on boot on swift machines, for sure on hp at least [08:27:48] namely T163673 [08:27:48] T163673: Some swift disks wrongly mounted on 5 ms-be hosts - https://phabricator.wikimedia.org/T163673 [08:27:56] yeah, I got that part :D [08:28:08] was on the puppettization side of it :-P [08:28:52] godog: so you're removing md from the regex, I guess cleanup from older configs? [08:29:45] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [08:29:46] volans: yeah I think so, we haven't been init_device with md since I've been here [08:37:14] RECOVERY - puppet last run on labsdb1001 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [08:38:14] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [08:38:34] RECOVERY - puppet last run on dbstore1002 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [08:38:54] RECOVERY - puppet last run on labsdb1005 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [08:43:54] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [08:44:14] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [08:44:55] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [08:44:59] (03CR) 10Filippo Giunchedi: "Indeed looks like this doesn't work properly in beta yet as per T166013" [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [08:46:50] !log legoktm@tin Synchronized php-1.30.0-wmf.6/includes/parser/ParserCache.php: Add debug logging for T168040 (duration: 00m 48s) [08:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:02] T168040: Table of contents (TOC) missing sporadically without apparent reason - https://phabricator.wikimedia.org/T168040 [08:49:04] RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [08:49:35] 10Operations, 10Mail, 10OTRS: OTRS spam classification methods and systems - https://phabricator.wikimedia.org/T146968#2676399 (10jberkel) It looks like the anti-spam mechanism is not working as expected. Almost all (90%) of the OTRS emails I currently get are spam emails, to the point that I don't feel like... [08:53:15] PROBLEM - dhclient process on thumbor1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:15] PROBLEM - nutcracker process on thumbor1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:54:04] RECOVERY - nutcracker process on thumbor1003 is OK: PROCS OK: 1 process with UID = 111 (nutcracker), command name nutcracker [08:54:05] RECOVERY - dhclient process on thumbor1003 is OK: PROCS OK: 0 processes with command name dhclient [08:55:21] <_joe_> uhm thumbor is out of prod right now ? [08:55:43] no we've put it back last night, I'll take a look at what's up [09:03:13] (03CR) 10Volans: [C: 04-1] "LGMT, just one probably typo." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/361648 (https://phabricator.wikimedia.org/T163673) (owner: 10Filippo Giunchedi) [09:05:24] (03PS1) 10Jcrespo: Add basedir option to all templates; restore option on init.d [puppet] - 10https://gerrit.wikimedia.org/r/361831 (https://phabricator.wikimedia.org/T168356) [09:05:53] !log legoktm@tin Synchronized php-1.30.0-wmf.7/includes/parser/ParserCache.php: Add debug logging for T168040 (duration: 00m 46s) [09:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:04] T168040: Table of contents (TOC) missing sporadically without apparent reason - https://phabricator.wikimedia.org/T168040 [09:07:18] (03CR) 10Volans: "see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/361831 (https://phabricator.wikimedia.org/T168356) (owner: 10Jcrespo) [09:07:35] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/6866/ confirms this is a noop." [puppet] - 10https://gerrit.wikimedia.org/r/361673 (owner: 10Giuseppe Lavagetto) [09:08:14] (03PS2) 10Jcrespo: Add basedir option to all templates; restore option on init.d [puppet] - 10https://gerrit.wikimedia.org/r/361831 (https://phabricator.wikimedia.org/T168356) [09:09:04] (03PS5) 10Giuseppe Lavagetto: cassandra::instance: allow defining multiple data directories [puppet] - 10https://gerrit.wikimedia.org/r/361673 [09:09:38] (03CR) 10jerkins-bot: [V: 04-1] Add basedir option to all templates; restore option on init.d [puppet] - 10https://gerrit.wikimedia.org/r/361831 (https://phabricator.wikimedia.org/T168356) (owner: 10Jcrespo) [09:11:06] <_joe_> https://cdn.meme.am/cache/instances/folder46/65289046.jpg [09:11:51] (03PS3) 10Jcrespo: Add basedir option to all templates; restore option on init.d [puppet] - 10https://gerrit.wikimedia.org/r/361831 (https://phabricator.wikimedia.org/T168356) [09:12:02] (03CR) 10Giuseppe Lavagetto: [C: 032] cassandra::instance: allow defining multiple data directories [puppet] - 10https://gerrit.wikimedia.org/r/361673 (owner: 10Giuseppe Lavagetto) [09:12:54] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [09:13:44] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [09:14:52] XioNoX: \o/ :) [09:15:04] good [09:15:07] (03CR) 10jerkins-bot: [V: 04-1] Add basedir option to all templates; restore option on init.d [puppet] - 10https://gerrit.wikimedia.org/r/361831 (https://phabricator.wikimedia.org/T168356) (owner: 10Jcrespo) [09:17:09] (03PS4) 10Jcrespo: Add basedir option to all templates; restore option on init.d [puppet] - 10https://gerrit.wikimedia.org/r/361831 (https://phabricator.wikimedia.org/T168356) [09:18:46] (03CR) 10Jcrespo: [C: 032] Add basedir option to all templates; restore option on init.d [puppet] - 10https://gerrit.wikimedia.org/r/361831 (https://phabricator.wikimedia.org/T168356) (owner: 10Jcrespo) [09:18:52] (03PS5) 10Jcrespo: Add basedir option to all templates; restore option on init.d [puppet] - 10https://gerrit.wikimedia.org/r/361831 (https://phabricator.wikimedia.org/T168356) [09:23:14] PROBLEM - puppet last run on labsdb1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:23:43] (03CR) 10Volans: "See inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/361647 (https://phabricator.wikimedia.org/T163673) (owner: 10Filippo Giunchedi) [09:24:54] PROBLEM - puppet last run on db2016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:26:04] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:27:28] (03PS1) 10Jcrespo: mariadb: Change init.d template variable to avoid defaults changes [puppet] - 10https://gerrit.wikimedia.org/r/361836 (https://phabricator.wikimedia.org/T168356) [09:28:14] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:28:14] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:28:53] (03PS2) 10Jcrespo: mariadb: Change init.d template variable to avoid defaults changes [puppet] - 10https://gerrit.wikimedia.org/r/361836 (https://phabricator.wikimedia.org/T168356) [09:30:47] (03CR) 10Jcrespo: [C: 032] mariadb: Change init.d template variable to avoid defaults changes [puppet] - 10https://gerrit.wikimedia.org/r/361836 (https://phabricator.wikimedia.org/T168356) (owner: 10Jcrespo) [09:30:54] PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:31:54] RECOVERY - puppet last run on db2016 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [09:37:50] (03PS1) 10Jcrespo: mariadb: Service followup to gerrit:361836 [puppet] - 10https://gerrit.wikimedia.org/r/361838 (https://phabricator.wikimedia.org/T168356) [09:42:33] (03CR) 10Jcrespo: [C: 032] mariadb: Service followup to gerrit:361836 [puppet] - 10https://gerrit.wikimedia.org/r/361838 (https://phabricator.wikimedia.org/T168356) (owner: 10Jcrespo) [09:49:39] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "instrumentation is used in internal scripts for deployment." [puppet] - 10https://gerrit.wikimedia.org/r/348074 (https://phabricator.wikimedia.org/T103882) (owner: 10Ema) [09:51:44] (03CR) 10Giuseppe Lavagetto: [C: 031] Release pybal 1.13.6 [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/348066 (https://phabricator.wikimedia.org/T103882) (owner: 10Ema) [09:52:14] RECOVERY - puppet last run on labsdb1011 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [09:52:34] (03CR) 10Alexandros Kosiaris: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/361662 (owner: 10Alexandros Kosiaris) [09:52:39] (03PS1) 10Jcrespo: labsdb-replica: Change default basedir to the 10.1 package [puppet] - 10https://gerrit.wikimedia.org/r/361841 (https://phabricator.wikimedia.org/T168356) [09:52:42] (03CR) 10Alexandros Kosiaris: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/361663 (owner: 10Alexandros Kosiaris) [09:56:04] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [09:56:14] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [09:56:24] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I don't really like pytest at all, but if you want to switch to it (a first for us) you should at least explain why in the commit message." [software/cumin] - 10https://gerrit.wikimedia.org/r/361274 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [09:56:57] (03PS2) 10Jcrespo: labsdb-replica: Change default basedir to the 10.1 package [puppet] - 10https://gerrit.wikimedia.org/r/361841 (https://phabricator.wikimedia.org/T168356) [09:57:14] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [09:57:39] (03CR) 10Paladox: "thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/361698 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [09:59:17] (03PS3) 10Jcrespo: labsdb-replica: Change default basedir to the 10.1 package [puppet] - 10https://gerrit.wikimedia.org/r/361841 (https://phabricator.wikimedia.org/T168356) [09:59:54] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [10:02:55] (03PS4) 10Jcrespo: labsdb-replica: Change default basedir to the 10.1 package [puppet] - 10https://gerrit.wikimedia.org/r/361841 (https://phabricator.wikimedia.org/T168356) [10:05:33] (03CR) 10jenkins-bot: Only enable logging on enwiki for MobileFormatter#moveFirstParagraphBeforeInfobox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361789 (https://phabricator.wikimedia.org/T169001) (owner: 10Jdlrobson) [10:05:35] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Add comments to db1033" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361815 (owner: 10Marostegui) [10:05:37] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361681 (https://phabricator.wikimedia.org/T142582) (owner: 10Jdrewniak) [10:05:39] (03CR) 10jenkins-bot: Cleanup: Remove wgMFContentNamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361788 (owner: 10Jdlrobson) [10:06:51] (03CR) 10Giuseppe Lavagetto: [C: 031] "but please fix the typo :)" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/357234 (https://phabricator.wikimedia.org/T158747) (owner: 10Volans) [10:09:22] (03PS5) 10Jcrespo: labsdb-replica: Change default basedir to the 10.1 package [puppet] - 10https://gerrit.wikimedia.org/r/361841 (https://phabricator.wikimedia.org/T168356) [10:10:18] (03PS1) 10Ema: Add firewall rules for pinkunicorn [puppet] - 10https://gerrit.wikimedia.org/r/361844 (https://phabricator.wikimedia.org/T169039) [10:16:43] (03CR) 10Jcrespo: [C: 032] labsdb-replica: Change default basedir to the 10.1 package [puppet] - 10https://gerrit.wikimedia.org/r/361841 (https://phabricator.wikimedia.org/T168356) (owner: 10Jcrespo) [10:23:27] (03PS5) 10Volans: CLI: improve configuration error handling [software/cumin] - 10https://gerrit.wikimedia.org/r/357234 (https://phabricator.wikimedia.org/T158747) [10:23:47] (03CR) 10Volans: "addressed comments" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/357234 (https://phabricator.wikimedia.org/T158747) (owner: 10Volans) [10:24:05] (03CR) 10jerkins-bot: [V: 04-1] CLI: improve configuration error handling [software/cumin] - 10https://gerrit.wikimedia.org/r/357234 (https://phabricator.wikimedia.org/T158747) (owner: 10Volans) [10:24:18] (03PS1) 10Ema: 4.1.6-1wm2: new varnish-counters for transient storage [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/361845 (https://phabricator.wikimedia.org/T164768) [10:24:56] (03PS6) 10Volans: CLI: improve configuration error handling [software/cumin] - 10https://gerrit.wikimedia.org/r/357234 (https://phabricator.wikimedia.org/T158747) [10:25:48] (03CR) 10Giuseppe Lavagetto: [C: 031] ClusterShell: allow to set a timeout per command (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/359466 (https://phabricator.wikimedia.org/T164838) (owner: 10Volans) [10:25:50] (03CR) 10Volans: [C: 032] CLI: improve configuration error handling [software/cumin] - 10https://gerrit.wikimedia.org/r/357234 (https://phabricator.wikimedia.org/T158747) (owner: 10Volans) [10:26:38] (03Merged) 10jenkins-bot: CLI: improve configuration error handling [software/cumin] - 10https://gerrit.wikimedia.org/r/357234 (https://phabricator.wikimedia.org/T158747) (owner: 10Volans) [10:26:42] (03PS1) 10Jcrespo: mariadb: Avoid the usage of undef or '' due to its falsey nature [puppet] - 10https://gerrit.wikimedia.org/r/361846 (https://phabricator.wikimedia.org/T168356) [10:28:05] (03CR) 10Giuseppe Lavagetto: [C: 031] "I think this is the behaviour everyone was expecting." [software/cumin] - 10https://gerrit.wikimedia.org/r/359467 (https://phabricator.wikimedia.org/T164838) (owner: 10Volans) [10:29:02] (03PS2) 10Ema: Add firewall rules for pinkunicorn [puppet] - 10https://gerrit.wikimedia.org/r/361844 (https://phabricator.wikimedia.org/T169039) [10:35:20] (03PS2) 10Jcrespo: mariadb: patch mariadb.service to include the basedir [puppet] - 10https://gerrit.wikimedia.org/r/361846 (https://phabricator.wikimedia.org/T168356) [10:37:09] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I am on the verge of -2ing this patch, but the gist of it is:" [software/cumin] - 10https://gerrit.wikimedia.org/r/361040 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [10:37:26] (03PS3) 10Jcrespo: mariadb: patch mariadb.service to include the basedir [puppet] - 10https://gerrit.wikimedia.org/r/361846 (https://phabricator.wikimedia.org/T168356) [10:40:13] (03CR) 10Jcrespo: [C: 032] mariadb: patch mariadb.service to include the basedir [puppet] - 10https://gerrit.wikimedia.org/r/361846 (https://phabricator.wikimedia.org/T168356) (owner: 10Jcrespo) [10:52:36] !log restarting db2072's mysql for testing of new config [10:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:51] PSA: for some reason mysql doesn't work, even if it is an alias: Failed to start mysql.service: Unit mysql.service not found. [10:56:05] systemctl start mariadb is needed right now on stretch [11:01:04] nagios works [11:01:11] (03PS4) 10Volans: ClusterShell: allow to set a timeout per command [software/cumin] - 10https://gerrit.wikimedia.org/r/359466 (https://phabricator.wikimedia.org/T164838) [11:01:17] (03CR) 10jerkins-bot: [V: 04-1] ClusterShell: allow to set a timeout per command [software/cumin] - 10https://gerrit.wikimedia.org/r/359466 (https://phabricator.wikimedia.org/T164838) (owner: 10Volans) [11:01:30] (03CR) 10Volans: "Addressed comment" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/359466 (https://phabricator.wikimedia.org/T164838) (owner: 10Volans) [11:01:52] * volans should stop using the edit button on gerrit :) [11:02:14] grafana works [11:02:36] (the parts that work on stretch, of course) [11:04:42] !log restarting db2062's mysql [11:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:00] (03PS5) 10Volans: ClusterShell: allow to set a timeout per command [software/cumin] - 10https://gerrit.wikimedia.org/r/359466 (https://phabricator.wikimedia.org/T164838) [11:08:18] I cannot discard some log noise about 10.192.16.195 [11:08:23] ignore it [11:12:13] the peak of errors should not longer exist, it happened from 11:06 to 11:12 [11:12:36] none user-facing [11:14:35] !log stop eventlogging_sync on db1047 - alter tables running [11:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:28] (03PS4) 10Volans: CLI: migrate to timeout per command [software/cumin] - 10https://gerrit.wikimedia.org/r/359467 (https://phabricator.wikimedia.org/T164838) [11:44:24] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add build script plus nodejs base images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/360813 (owner: 10Giuseppe Lavagetto) [11:46:46] (03PS4) 10Alexandros Kosiaris: Renumber 4 VMs to public1-c-eqiad [dns] - 10https://gerrit.wikimedia.org/r/361662 [11:46:49] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Renumber 4 VMs to public1-c-eqiad [dns] - 10https://gerrit.wikimedia.org/r/361662 (owner: 10Alexandros Kosiaris) [11:48:50] !log renumber dubnium fermium meitnerium ununpentium [11:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:44] PROBLEM - Host meitnerium is DOWN: PING CRITICAL - Packet loss = 100% [11:54:15] (03PS1) 10Daniel Kinzler: Temporarily set $wgPropertySuggesterClassifyingPropertyIds to [ 31 ]. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361851 (https://phabricator.wikimedia.org/T169058) [11:55:10] RECOVERY - Host meitnerium is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [11:58:19] (03PS2) 10Daniel Kinzler: Temporarily set $wgPropertySuggesterClassifyingPropertyIds to [ 31 ]. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361851 (https://phabricator.wikimedia.org/T169058) [12:03:07] (03PS1) 10Alexandros Kosiaris: Update mailman::lists IPv4/IPv6 addresses [puppet] - 10https://gerrit.wikimedia.org/r/361852 [12:03:27] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Update mailman::lists IPv4/IPv6 addresses [puppet] - 10https://gerrit.wikimedia.org/r/361852 (owner: 10Alexandros Kosiaris) [12:13:14] (03CR) 10Ayounsi: [C: 031] Add firewall rules for pinkunicorn [puppet] - 10https://gerrit.wikimedia.org/r/361844 (https://phabricator.wikimedia.org/T169039) (owner: 10Ema) [12:18:12] !log Deploy alter table on s7 directly on codfw master (db2029) and let it replicate - T168661 [12:18:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:22] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [12:19:10] 10Operations, 10Traffic: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#3386169 (10faidon) Let's not forget to actually revoke those certificates too. We're getting a little off-topic here though, so perhaps @Jgreen / @CCogdill_WMF / @DKaufma... [12:21:54] (03CR) 10Elukey: "@Volans, ready for your review :)" [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [12:26:00] RECOVERY - MariaDB Slave Lag: s3 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 89979.55 seconds [12:27:12] 10Operations, 10Traffic: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#3386198 (10Jgreen) 05Open>03Resolved a:03Jgreen >>! In T137161#3386169, @faidon wrote: > Let's not forget to actually revoke those certificates too. We're getting a... [12:27:15] 10Operations, 10Traffic, 10HTTPS, 10Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#3386203 (10Jgreen) [12:29:40] 10Operations, 10Traffic, 10HTTPS, 10Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#3386206 (10Jgreen) [12:29:43] 10Operations, 10Traffic: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#3386204 (10Jgreen) 05Resolved>03Open I didn't intentionally close this task... [12:37:52] !log restricted inbound BGP to configured neighbors on pfw - T169048 [12:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:08] !log starting enabling puppet on db2* hosts [12:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:40] 10Operations, 10Electron-PDFs, 10Reading-Web-Backlog, 10Services, 10Patch-For-Review: pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3386277 (10ovasileva) [12:46:30] PROBLEM - puppet last run on db2064 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:47:26] !log Deploy alter table on s3 directly on codfw master (db2018) and let it replicate - T168661 [12:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:35] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [12:49:53] 10Operations, 10ops-eqiad: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3268863 (10faidon) Why do these need to be jessie? Has anyone checked whether that rootdelay=5 workaround is still needed in stretch? [12:51:14] (03CR) 10Filippo Giunchedi: swift: use implicit /dev/swift prefix for swift devices (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/361648 (https://phabricator.wikimedia.org/T163673) (owner: 10Filippo Giunchedi) [12:53:05] (03PS4) 10Filippo Giunchedi: swift: ship udev rules for swift disks [puppet] - 10https://gerrit.wikimedia.org/r/361647 (https://phabricator.wikimedia.org/T163673) [12:53:07] (03PS4) 10Filippo Giunchedi: swift: use implicit /dev/swift prefix for swift devices [puppet] - 10https://gerrit.wikimedia.org/r/361648 (https://phabricator.wikimedia.org/T163673) [12:53:14] (03CR) 10Filippo Giunchedi: swift: ship udev rules for swift disks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/361647 (https://phabricator.wikimedia.org/T163673) (owner: 10Filippo Giunchedi) [12:54:42] 10Operations, 10ops-eqiad: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3268863 (10akosiaris) >>! In T165520#3386297, @faidon wrote: > Why do these need to be jessie? Has anyone checked whether that rootdelay=5 workaround is still needed in stretch? parsoid. It's nodejs `6.9.1` (fr... [12:54:50] (03CR) 10Volans: [C: 031] "LGTM, replied inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/361648 (https://phabricator.wikimedia.org/T163673) (owner: 10Filippo Giunchedi) [12:55:40] PROBLEM - puppet last run on db2041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:57:36] (03PS5) 10Filippo Giunchedi: swift: ship udev rules for swift disks [puppet] - 10https://gerrit.wikimedia.org/r/361647 (https://phabricator.wikimedia.org/T163673) [12:57:38] (03PS5) 10Filippo Giunchedi: swift: use implicit /dev/swift prefix for swift devices [puppet] - 10https://gerrit.wikimedia.org/r/361648 (https://phabricator.wikimedia.org/T163673) [12:58:30] PROBLEM - puppet last run on db2063 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:58:40] PROBLEM - puppet last run on db2049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170628T1300). [13:05:36] (03PS2) 10Gehel: maps - align configuration for all maps clusters [puppet] - 10https://gerrit.wikimedia.org/r/351872 (https://phabricator.wikimedia.org/T160215) [13:05:51] o/ [13:05:58] and there is nothing to deploy :-) [13:06:20] PROBLEM - puppet last run on db2017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:10:40] looking at the errors [13:10:40] PROBLEM - puppet last run on db2035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:10:46] on puppet [13:12:23] (03CR) 10Volans: swift: ship udev rules for swift disks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/361647 (https://phabricator.wikimedia.org/T163673) (owner: 10Filippo Giunchedi) [13:14:00] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:21:49] (03PS1) 10Jcrespo: mysql-hiera: Add hosts for codfw:s2, which were missing [puppet] - 10https://gerrit.wikimedia.org/r/361857 (https://phabricator.wikimedia.org/T148507) [13:22:37] (03CR) 10Marostegui: [C: 031] mysql-hiera: Add hosts for codfw:s2, which were missing [puppet] - 10https://gerrit.wikimedia.org/r/361857 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [13:23:44] (03PS2) 10Jcrespo: mysql-hiera: Add hosts for codfw:s2, which were missing [puppet] - 10https://gerrit.wikimedia.org/r/361857 (https://phabricator.wikimedia.org/T148507) [13:26:02] (03CR) 10Jcrespo: [C: 032] mysql-hiera: Add hosts for codfw:s2, which were missing [puppet] - 10https://gerrit.wikimedia.org/r/361857 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [13:28:50] RECOVERY - puppet last run on db2035 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [13:29:30] RECOVERY - puppet last run on db2017 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [13:30:11] !log renumber install1002 [13:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:52] (03PS4) 10Alexandros Kosiaris: Renumber install1002.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/361663 [13:31:02] (03CR) 10Alexandros Kosiaris: [C: 032] Renumber install1002.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/361663 (owner: 10Alexandros Kosiaris) [13:33:41] PROBLEM - puppet last run on install1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:35:10] PROBLEM - Host install1002 is DOWN: PING CRITICAL - Packet loss = 100% [13:37:12] (03PS3) 10Strainu: Set collation for Romanian wikis to uca-ro [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361066 (https://phabricator.wikimedia.org/T168711) [13:41:45] (03PS3) 10Gehel: maps - align configuration for all maps clusters [puppet] - 10https://gerrit.wikimedia.org/r/351872 (https://phabricator.wikimedia.org/T169082) [13:44:29] !log start reimage of the maps-test cluster - T169011 [13:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:39] T169011: reimage maps-test servers - https://phabricator.wikimedia.org/T169011 [13:45:01] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [13:45:19] (03PS4) 10Gehel: maps - align configuration for all maps clusters [puppet] - 10https://gerrit.wikimedia.org/r/351872 (https://phabricator.wikimedia.org/T169082) [13:47:32] RECOVERY - puppet last run on db2064 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [13:51:22] !log tigntening BGP configuration on cr* routers - T169048 [13:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:43] (03CR) 10Gehel: [C: 032] maps - align configuration for all maps clusters [puppet] - 10https://gerrit.wikimedia.org/r/351872 (https://phabricator.wikimedia.org/T169082) (owner: 10Gehel) [13:52:01] (03PS1) 10Rush: labnet: SSH listen only on administrative IP [puppet] - 10https://gerrit.wikimedia.org/r/361860 (https://phabricator.wikimedia.org/T169068) [13:52:47] chasemp: don't use ipaddress_eth0 anymore [13:52:56] paravoid: what is the preferred? [13:53:06] just ipaddress [13:53:16] ok can do [13:53:26] it DTRT these days, we override it with our own version [13:54:04] and _eth0 is not future-proof, as ethN interface names are essentially deprecated :) [13:54:33] * chasemp nods -- understood yeah thank you [13:55:24] (03PS2) 10Rush: labnet: SSH listen only on administrative IP [puppet] - 10https://gerrit.wikimedia.org/r/361860 (https://phabricator.wikimedia.org/T169068) [13:55:32] yw :) [13:57:50] RECOVERY - puppet last run on db2041 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [13:58:30] RECOVERY - puppet last run on db2063 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [14:00:50] RECOVERY - puppet last run on db2049 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [14:00:56] (03PS3) 10Rush: labnet: SSH listen only on administrative IP [puppet] - 10https://gerrit.wikimedia.org/r/361860 (https://phabricator.wikimedia.org/T169068) [14:00:58] (03CR) 10jerkins-bot: [V: 04-1] labnet: SSH listen only on administrative IP [puppet] - 10https://gerrit.wikimedia.org/r/361860 (https://phabricator.wikimedia.org/T169068) (owner: 10Rush) [14:02:30] (03PS3) 10Ema: Add firewall rules for pinkunicorn [puppet] - 10https://gerrit.wikimedia.org/r/361844 (https://phabricator.wikimedia.org/T169039) [14:03:54] !log pypi.python.org has an issue with its CDN . That would affect any CI jobs relying on tox/python - See https://status.python.org for updates [14:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:17] !log pypi.python.org has an issue with its CDN . That would affect any CI jobs relying on tox/python - See https://status.python.org for updates and T169091 [14:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:29] T169091: pypi.python.org having transient issues (503 errors) - https://phabricator.wikimedia.org/T169091 [14:06:29] hi, im wondering what do i do if i get this error from apt-get update please? [14:06:30] Err:5 http://cdn-fastly.deb.debian.org/debian stretch Release [14:06:30] 503 Maximum threads for service reached [IP: 151.101.88.204 80] [14:08:26] E: The repository 'http://httpredir.debian.org/debian stretch Release' does no longer have a Release file. [14:08:48] paladox: pypi.python.org has a similar issue and they are on Fastly as well [14:08:55] oh [14:08:57] thanks [14:12:42] (03PS4) 10Ema: Add firewall rules for pinkunicorn [puppet] - 10https://gerrit.wikimedia.org/r/361844 (https://phabricator.wikimedia.org/T169039) [14:14:48] hashar is fastly problem also affecting wikimedias apt repo? [14:14:55] I get a new error now. [14:14:58] Reading package lists... Done [14:14:58] W: Failed to fetch http://apt.wikimedia.org/wikimedia/dists/stretch-wikimedia/InRelease Cannot initiate the connection to apt.wikimedia.org:80 (2620:0:861:1:208:80:154:22). - connect (101: Network is unreachable) [IP: 2620:0:861:1:208:80:154:22 80] [14:14:58] W: Some index files failed to download. They have been ignored, or old ones used instead. [14:15:18] na Fastly is a public CDN [14:15:30] well by Public I mean, not a WMF one :) [14:15:33] (03PS1) 10BBlack: HSTS: higher and custom max-age [puppet] - 10https://gerrit.wikimedia.org/r/361864 [14:15:35] ok [14:15:41] apt.wikimedia.org is maintained by Wikimedia for sure [14:15:48] yeah I am renumbering this one [14:15:55] issue is known [14:16:00] ok thanks. [14:17:35] (03CR) 10Giuseppe Lavagetto: Add build script plus nodejs base images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/360813 (owner: 10Giuseppe Lavagetto) [14:18:04] (03PS8) 10Giuseppe Lavagetto: Add build script plus nodejs base images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/360813 [14:19:29] (03CR) 10Alexandros Kosiaris: [C: 031] Add build script plus nodejs base images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/360813 (owner: 10Giuseppe Lavagetto) [14:22:22] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Add build script plus nodejs base images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/360813 (owner: 10Giuseppe Lavagetto) [14:26:43] 10Operations, 10ops-eqiad: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3386850 (10faidon) We talked about this a little bit on IRC. I think we agreed to try stretch with node.js 6, since we're going to have to do that at some point anyway and there doesn't seem to be any reason to... [14:28:49] (03CR) 10Filippo Giunchedi: Add firewall rules for pinkunicorn (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/361844 (https://phabricator.wikimedia.org/T169039) (owner: 10Ema) [14:30:11] (03PS1) 10Gilles: Include fonts role in Thumbor role [puppet] - 10https://gerrit.wikimedia.org/r/361871 (https://phabricator.wikimedia.org/T169072) [14:32:30] (03CR) 10Filippo Giunchedi: [C: 032] Include fonts role in Thumbor role [puppet] - 10https://gerrit.wikimedia.org/r/361871 (https://phabricator.wikimedia.org/T169072) (owner: 10Gilles) [14:32:32] (03CR) 10Subramanya Sastry: [C: 031] "+1, but it also needs a change in the installation portions to remove the manually git install part of it." [puppet] - 10https://gerrit.wikimedia.org/r/327028 (owner: 10Legoktm) [14:33:07] (03CR) 10Subramanya Sastry: [C: 031] "ah, needs a manual rebase. let me do it." [puppet] - 10https://gerrit.wikimedia.org/r/327028 (owner: 10Legoktm) [14:34:21] hashar: what do you think of https://gerrit.wikimedia.org/r/#/c/358910/ btw ? [14:36:28] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, 10User-fgiunchedi: Decommission ms-be2001 - ms-be2012 - https://phabricator.wikimedia.org/T162785#3386931 (10Papaul) [14:36:47] (03PS5) 10Ema: Add firewall rules for pinkunicorn [puppet] - 10https://gerrit.wikimedia.org/r/361844 (https://phabricator.wikimedia.org/T169039) [14:37:02] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, 10User-fgiunchedi: Decommission ms-be2001 - ms-be2012 - https://phabricator.wikimedia.org/T162785#3174562 (10Papaul) @RobH This is complete on my end. Thanks [14:38:24] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, 10User-fgiunchedi: Decommission ms-be2001 - ms-be2012 - https://phabricator.wikimedia.org/T162785#3386952 (10Papaul) a:05Papaul>03RobH [14:39:20] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [14:39:39] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure, and 2 others: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3386956 (10Cmjohnson) [14:40:43] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure, and 2 others: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3269143 (10Cmjohnson) The second ethernet ports are cabled, cleared of the current vlan as far as I can tell. They need to be added to lab-instances. @faid... [14:42:58] !log pypi.python.org is back again - T169091 [14:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:08] T169091: Fastly CDN has some transient issue (was: pypi.python.org having transient issues (503 errors) ) - https://phabricator.wikimedia.org/T169091 [14:43:20] PROBLEM - puppet last run on thumbor1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 0 seconds ago with 2 failures. Failed resources (up to 3 shown): Package[ttf-ubuntu-font-family],Package[texlive-fonts-recommended] [14:43:35] (03PS2) 10Subramanya Sastry: Use packaged uprightdiff in testreduce and visualdiff [puppet] - 10https://gerrit.wikimedia.org/r/327028 (owner: 10Legoktm) [14:43:41] godog: I have forgot about it sorry :( [14:44:16] godog: if I parse your last comment properly, it only remove metrics that haven't been touched in the last 30 days isn't it ? [14:44:29] 10Operations, 10ops-codfw, 10hardware-requests: reclaim/decom tmh200[12] - https://phabricator.wikimedia.org/T168472#3386967 (10Papaul) Disk wipe in progress. [14:44:31] (03CR) 10Subramanya Sastry: [C: 031] "Okay, manually rebased + conflicts resolved." [puppet] - 10https://gerrit.wikimedia.org/r/327028 (owner: 10Legoktm) [14:45:42] hashar: that's correct yeah, looks at mtime [14:46:30] (03CR) 10Hashar: [C: 04-1] role: cleanup zuul data in graphite (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/358910 (https://phabricator.wikimedia.org/T1075) (owner: 10Filippo Giunchedi) [14:47:10] PROBLEM - DPKG on thumbor2004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:47:12] godog: excellent. Lets add /whisper/gerrit which is generated by Zuul . And we probably should add /whisper/nodepool as well (generated by Nodepool) [14:47:20] PROBLEM - DPKG on mw2259 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:49:41] hashar: ack thanks for looking! I'll amend it later [14:51:10] RECOVERY - DPKG on thumbor2004 is OK: All packages OK [14:51:10] PROBLEM - DPKG on thumbor1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:51:30] RECOVERY - puppet last run on thumbor1001 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [14:52:20] RECOVERY - DPKG on thumbor1002 is OK: All packages OK [14:52:21] RECOVERY - DPKG on mw2259 is OK: All packages OK [14:52:21] that's me ^ installing fonts packages + apt.w.o glitch = fun [14:54:00] PROBLEM - puppet last run on thumbor1002 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[ttf-ubuntu-font-family],Package[texlive-fonts-recommended] [14:54:10] PROBLEM - DPKG on mw2149 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:54:27] (03PS1) 10Alexandros Kosiaris: diamond: Re-enable some collectors [puppet] - 10https://gerrit.wikimedia.org/r/361878 [14:55:10] RECOVERY - DPKG on mw2149 is OK: All packages OK [14:55:53] godog: so yes definitely +1 [14:56:00] RECOVERY - puppet last run on thumbor1002 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [14:56:20] PROBLEM - DPKG on thumbor2002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:57:20] RECOVERY - DPKG on thumbor2002 is OK: All packages OK [14:58:29] (03CR) 10Alexandros Kosiaris: [C: 032] diamond: Re-enable some collectors [puppet] - 10https://gerrit.wikimedia.org/r/361878 (owner: 10Alexandros Kosiaris) [14:59:03] (03PS2) 10Hashar: role: cleanup CI data in graphite [puppet] - 10https://gerrit.wikimedia.org/r/358910 (https://phabricator.wikimedia.org/T1075) (owner: 10Filippo Giunchedi) [14:59:10] PROBLEM - puppet last run on thumbor2001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[fonts-taml-tscu],Package[xfonts-scalable] [14:59:10] PROBLEM - puppet last run on mw2148 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[fonts-taml-tscu],Package[xfonts-scalable] [15:00:11] PROBLEM - puppet last run on mw2149 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[fonts-thai-tlwg],Package[fonts-sil-scheherazade] [15:00:11] PROBLEM - puppet last run on thumbor2002 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[fonts-thai-tlwg],Package[fonts-sil-scheherazade] [15:00:28] 10Operations, 10Labs, 10Labs-Infrastructure, 10procurement: rack/setup/install labtestcontrol2003.wikimedia.org - https://phabricator.wikimedia.org/T168894#3387058 (10Papaul) [15:01:28] (03CR) 10Hashar: "I took the liberty to amend the change and clean out /gerrit and /nodepool" [puppet] - 10https://gerrit.wikimedia.org/r/358910 (https://phabricator.wikimedia.org/T1075) (owner: 10Filippo Giunchedi) [15:02:27] (03PS2) 10BBlack: HSTS: higher and custom max-age [puppet] - 10https://gerrit.wikimedia.org/r/361864 [15:02:29] (03PS1) 10BBlack: ssl_ciphersuite: limit ECDH curves where possible [puppet] - 10https://gerrit.wikimedia.org/r/361879 [15:03:10] RECOVERY - puppet last run on mw2149 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [15:03:11] RECOVERY - puppet last run on thumbor2002 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [15:03:11] RECOVERY - puppet last run on mw2148 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [15:03:11] RECOVERY - puppet last run on thumbor2001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [15:04:19] (03PS6) 10Ema: Add firewall rules for pinkunicorn [puppet] - 10https://gerrit.wikimedia.org/r/361844 (https://phabricator.wikimedia.org/T169039) [15:07:37] (03PS1) 10Alexandros Kosiaris: Typo fix for LoadAverage diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/361880 [15:08:23] 10Operations, 10Labs, 10Labs-Infrastructure, 10procurement: rack/setup/install labtestcontrol2003.wikimedia.org - https://phabricator.wikimedia.org/T168894#3387089 (10Papaul) Rack location Row C rack C1 [15:08:41] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Typo fix for LoadAverage diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/361880 (owner: 10Alexandros Kosiaris) [15:11:01] RECOVERY - Host install1002 is UP: PING OK - Packet loss = 0%, RTA = 1.21 ms [15:11:11] RECOVERY - puppet last run on install1002 is OK: OK: Puppet is currently enabled, last run 21 minutes ago with 0 failures [15:11:54] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestmetal2001.codfw.wmnet - https://phabricator.wikimedia.org/T168891#3387106 (10Papaul) Rack location Row B rack B8 [15:14:24] (03PS1) 10Giuseppe Lavagetto: profile::docker::builder: check out production-images [puppet] - 10https://gerrit.wikimedia.org/r/361881 (https://phabricator.wikimedia.org/T162042) [15:16:51] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure: rack/setup/install labpuppetmaster100[12].wikimedia.org - https://phabricator.wikimedia.org/T167905#3387165 (10Cmjohnson) [15:17:00] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure: rack/setup/install labpuppetmaster100[12].wikimedia.org - https://phabricator.wikimedia.org/T167905#3349071 (10Cmjohnson) network ports are setup [15:20:12] cmjohnson1: o/ - any news about an1030? Otherwise I'll open a tracking task [15:21:57] (03PS2) 10Giuseppe Lavagetto: profile::docker::builder: check out production-images [puppet] - 10https://gerrit.wikimedia.org/r/361881 (https://phabricator.wikimedia.org/T162042) [15:23:03] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] profile::docker::builder: check out production-images [puppet] - 10https://gerrit.wikimedia.org/r/361881 (https://phabricator.wikimedia.org/T162042) (owner: 10Giuseppe Lavagetto) [15:24:04] (03CR) 10Brian Wolff: [C: 031] Set collation for Romanian wikis to uca-ro [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361066 (https://phabricator.wikimedia.org/T168711) (owner: 10Strainu) [15:24:39] @elukey an1030 kept pxe booting...looking at it now [15:25:29] (03CR) 10Filippo Giunchedi: swift: ship udev rules for swift disks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/361647 (https://phabricator.wikimedia.org/T163673) (owner: 10Filippo Giunchedi) [15:25:31] thanks! [15:25:37] not super urgent though! [15:25:50] if you have more pressing thing feel free to drop it [15:26:18] (03CR) 10Jcrespo: "Sorry, too late to +1" [puppet] - 10https://gerrit.wikimedia.org/r/361878 (owner: 10Alexandros Kosiaris) [15:26:36] 10Operations, 10Labs, 10Labs-Infrastructure, 10procurement: rack/setup/install labtestcontrol2003.wikimedia.org - https://phabricator.wikimedia.org/T168894#3387217 (10chasemp) a:05chasemp>03RobH >>! In T168894#3379828, @RobH wrote: > @Chasemp: Please review my racking and vlan/IP proposal above and co... [15:28:15] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestservices2003.wikimedia.org - https://phabricator.wikimedia.org/T168893#3387233 (10chasemp) a:05chasemp>03Papaul >>! In T168893#3379863, @RobH wrote: > @Chasemp: Please review my racking and vlan/IP proposal above and c... [15:28:51] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestservices2002.wikimedia.org - https://phabricator.wikimedia.org/T168892#3387235 (10chasemp) a:05chasemp>03Papaul >>! In T168892#3379861, @RobH wrote: > @Chasemp: Please review my racking and vlan/IP proposal above and c... [15:29:35] !log slowly enabling puppet on pending database hosts, checking diff on each one [15:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:30] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestmetal2001.codfw.wmnet - https://phabricator.wikimedia.org/T168891#3387241 (10chasemp) >>! In T168891#3379859, @RobH wrote: > @Chasemp: Please review my racking and vlan/IP proposal above and confirm or correct. Once that... [15:30:41] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestmetal2001.codfw.wmnet - https://phabricator.wikimedia.org/T168891#3387242 (10chasemp) a:05chasemp>03Papaul [15:31:03] (03PS2) 10BBlack: VCL: remove disableImages handling [puppet] - 10https://gerrit.wikimedia.org/r/359417 (https://phabricator.wikimedia.org/T168013) (owner: 10Ema) [15:31:47] 10Operations, 10Discovery, 10Maps, 10Traffic, 10Interactive-Sprint: Rate-limit browsers without referers - https://phabricator.wikimedia.org/T154704#3387243 (10Gehel) a:03ema Significant work has already be done on T163233. @ema is aware of this task and will come back to us with some idea / plan / or... [15:31:52] 10Operations, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestneutron2002 - https://phabricator.wikimedia.org/T167160#3387248 (10chasemp) [15:32:04] (03Abandoned) 10BBlack: un-anchor regexes for select [software/conftool] - 10https://gerrit.wikimedia.org/r/294371 (owner: 10BBlack) [15:32:13] 10Operations, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestneutron2002 - https://phabricator.wikimedia.org/T167160#3319508 (10chasemp) 05Open>03Resolved I am going to close this, consider it done :) I am handling configuration here in other tasks as it's ongoing and initial thank you @robh... [15:32:51] 10Operations, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestnet2002 - https://phabricator.wikimedia.org/T167159#3387252 (10chasemp) 05Open>03Resolved I am going to close this, consider it done :) I am handling configuration here in other tasks as it's ongoing and initial thank you @RobH and @... [15:33:54] (03CR) 10Filippo Giunchedi: swift: ship udev rules for swift disks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/361647 (https://phabricator.wikimedia.org/T163673) (owner: 10Filippo Giunchedi) [15:33:55] (03PS4) 10Rush: labnet: SSH listen only on administrative IP [puppet] - 10https://gerrit.wikimedia.org/r/361860 (https://phabricator.wikimedia.org/T169068) [15:37:13] (03CR) 10Andrew Bogott: [C: 031] labnet: SSH listen only on administrative IP [puppet] - 10https://gerrit.wikimedia.org/r/361860 (https://phabricator.wikimedia.org/T169068) (owner: 10Rush) [15:39:16] (03PS6) 10Filippo Giunchedi: swift: ship udev rules for swift disks [puppet] - 10https://gerrit.wikimedia.org/r/361647 (https://phabricator.wikimedia.org/T163673) [15:39:35] elukey: an1030 is back [15:39:53] thanks! [15:40:52] urandom: might be 5/10 minutes late to the standup sorry, conflicting meetings :( [15:41:07] (03CR) 10Filippo Giunchedi: icinga/role:mail::mx: add monitoring of exim queue size (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) (owner: 10Dzahn) [15:41:16] (03CR) 10Filippo Giunchedi: [C: 032] swift: ship udev rules for swift disks [puppet] - 10https://gerrit.wikimedia.org/r/361647 (https://phabricator.wikimedia.org/T163673) (owner: 10Filippo Giunchedi) [15:44:20] !log kartik@tin Started deploy [cxserver/deploy@894e3fe]: (no justification provided) [15:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:29] 10Operations, 10DBA, 10Patch-For-Review: Prepare mysql hosts for stretch - https://phabricator.wikimedia.org/T168356#3387288 (10jcrespo) With the new puppet refactoring, hosts just work- that doesn't mean the puppet structure is ideal- we need to change many things such as multi-instance support and fix the... [15:45:54] (03PS7) 10Ema: Add firewall rules for pinkunicorn [puppet] - 10https://gerrit.wikimedia.org/r/361844 (https://phabricator.wikimedia.org/T169039) [15:47:06] !log kartik@tin Finished deploy [cxserver/deploy@894e3fe]: (no justification provided) (duration: 02m 47s) [15:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:19] (03CR) 10Rush: [C: 032] labnet: SSH listen only on administrative IP [puppet] - 10https://gerrit.wikimedia.org/r/361860 (https://phabricator.wikimedia.org/T169068) (owner: 10Rush) [15:47:25] (03PS5) 10Rush: labnet: SSH listen only on administrative IP [puppet] - 10https://gerrit.wikimedia.org/r/361860 (https://phabricator.wikimedia.org/T169068) [15:47:57] PROBLEM - puppet last run on db1084 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:49:46] (03PS8) 10Ema: Add firewall rules for pinkunicorn [puppet] - 10https://gerrit.wikimedia.org/r/361844 (https://phabricator.wikimedia.org/T169039) [15:50:00] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestservices2002.wikimedia.org - https://phabricator.wikimedia.org/T168892#3387313 (10Papaul) Rack location Row C rack C1 [15:50:56] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestservices2003.wikimedia.org - https://phabricator.wikimedia.org/T168893#3387315 (10Papaul) Rack location Row D rack D1 [15:52:00] (03PS1) 10Jcrespo: mariadb: Add missing host db1084 [puppet] - 10https://gerrit.wikimedia.org/r/361891 (https://phabricator.wikimedia.org/T148507) [15:52:16] !log Temporary ignore jawiki.watchlist table during replication on dbstore1001 - T169050 [15:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:26] T169050: dbstore1001 mysql crashed with: semaphore wait has lasted > 600 seconds - https://phabricator.wikimedia.org/T169050 [15:53:15] (03PS2) 10Jcrespo: mariadb: Add missing host db1084 [puppet] - 10https://gerrit.wikimedia.org/r/361891 (https://phabricator.wikimedia.org/T148507) [15:54:16] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestservices2003.wikimedia.org - https://phabricator.wikimedia.org/T168893#3387322 (10Papaul) [15:54:39] (03CR) 10Jcrespo: [C: 032] mariadb: Add missing host db1084 [puppet] - 10https://gerrit.wikimedia.org/r/361891 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [15:54:54] godog: disk for ms-be1016 is here..can you confirm which one is bad please [15:55:04] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestservices2002.wikimedia.org - https://phabricator.wikimedia.org/T168892#3387333 (10Papaul) [15:55:53] kart_: strange that no alarms have fired wither [15:56:04] kart_: i'll take a look at the dpeloyment logs [15:56:21] mobrovac: okay. [15:56:35] mobrovac: I've not seen any errors there too. [15:56:44] that should fix db1084 [15:56:57] RECOVERY - puppet last run on db1084 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [15:56:59] (03PS5) 10Andrew Bogott: Add labtestpuppetmaster2001 hiera host defs [puppet] - 10https://gerrit.wikimedia.org/r/361704 [15:57:26] (03CR) 10Mark Bergsma: [C: 04-1] "No actual objection from me yet, but let's make sure this is widely discussed before merging, as it's non-reversible." [puppet] - 10https://gerrit.wikimedia.org/r/361864 (owner: 10BBlack) [15:57:37] 10Operations, 10Labs, 10Labs-Infrastructure, 10procurement: rack/setup/install labtestcontrol2003.wikimedia.org - https://phabricator.wikimedia.org/T168894#3387344 (10RobH) a:05RobH>03Papaul [15:58:30] kart_: yup, deploy logs look good [15:59:24] mobrovac: since it is backward competible, I guess, logs won't add anything. [16:00:27] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [16:00:43] (03CR) 10Andrew Bogott: [C: 032] Add labtestpuppetmaster2001 hiera host defs [puppet] - 10https://gerrit.wikimedia.org/r/361704 (owner: 10Andrew Bogott) [16:01:01] (03PS9) 10Ema: Add firewall rules for pinkunicorn [puppet] - 10https://gerrit.wikimedia.org/r/361844 (https://phabricator.wikimedia.org/T169039) [16:01:16] kart_: i get the same response when i run a local instance of cxserver [16:03:27] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [16:03:33] ok, the registry file exists on the nodes in production, so it's not a config problem kart_ [16:05:29] mobrovac: yeah. Looks like something broken in code itself. [16:14:07] (03PS10) 10Ema: Add firewall rules for pinkunicorn [puppet] - 10https://gerrit.wikimedia.org/r/361844 (https://phabricator.wikimedia.org/T169039) [16:18:07] (03CR) 10Volans: [C: 04-1] "Good job! I think we're almost there, although I have some concerns on the UUID selection, see all of them inline." (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [16:21:40] kart_: /v1/list/languagepairs returns an empty response as well [16:23:23] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure: rack/setup/install labcontrol100[34] - https://phabricator.wikimedia.org/T165781#3387471 (10Cmjohnson) [16:23:33] volans: thanks for the review, all good comments [16:23:43] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure: rack/setup/install labcontrol100[34] - https://phabricator.wikimedia.org/T165781#3276700 (10Cmjohnson) @robh network ports updated and vlans set [16:23:47] PROBLEM - Check whether ferm is active by checking the default input chain on mw2148 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [16:24:47] RECOVERY - Check whether ferm is active by checking the default input chain on mw2148 is OK: OK ferm input default policy is set [16:24:48] mobrovac: yes. just checked all points. [16:25:15] elukey: yw :) i skimmed very quickly the tests tbh ;) [16:29:59] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure: rack/setup/install labcontrol100[34] - https://phabricator.wikimedia.org/T165781#3387522 (10RobH) a:05Cmjohnson>03RobH [16:31:12] 10Operations, 10ops-eqiad, 10Analytics-Kanban: analytics1030 stuck in console while booting - https://phabricator.wikimedia.org/T162046#3387528 (10Cmjohnson) 05Open>03Resolved New board added, updated idrac license...back online. [16:32:28] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack/setup/install conf1004-conf1006 - https://phabricator.wikimedia.org/T166081#3387531 (10Cmjohnson) @robh if you have the time to get these going that would be great [16:33:19] 10Operations, 10Dumps-Generation, 10Patch-For-Review: rack/setup/install dumpsdata100[12] - https://phabricator.wikimedia.org/T165173#3387532 (10Cmjohnson) Removing ops-eqiad tag....since on-site work i s no longer needed [16:34:09] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack/setup/install conf1004-conf1006 - https://phabricator.wikimedia.org/T166081#3387554 (10RobH) a:05Cmjohnson>03RobH [16:40:02] (03PS1) 10RobH: setting production dns for labcontrol100[34] [dns] - 10https://gerrit.wikimedia.org/r/361900 [16:43:11] (03CR) 10RobH: [C: 032] setting production dns for labcontrol100[34] [dns] - 10https://gerrit.wikimedia.org/r/361900 (owner: 10RobH) [16:45:39] 10Operations, 10Labs, 10Labs-Infrastructure, 10procurement: rack/setup/install labtestcontrol2003.wikimedia.org - https://phabricator.wikimedia.org/T168894#3387588 (10Papaul) [16:46:05] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestservices2003.wikimedia.org - https://phabricator.wikimedia.org/T168893#3387589 (10Papaul) [16:46:29] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestservices2002.wikimedia.org - https://phabricator.wikimedia.org/T168892#3387590 (10Papaul) [16:48:02] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestmetal2001.codfw.wmnet - https://phabricator.wikimedia.org/T168891#3387596 (10Papaul) [16:48:14] (03PS3) 10BBlack: VCL: remove disableImages handling [puppet] - 10https://gerrit.wikimedia.org/r/359417 (https://phabricator.wikimedia.org/T168013) (owner: 10Ema) [16:48:40] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestmetal2001.codfw.wmnet - https://phabricator.wikimedia.org/T168891#3379777 (10Papaul) @chasemp ok will do. [16:50:45] (03PS3) 10Filippo Giunchedi: role: cleanup CI data in graphite [puppet] - 10https://gerrit.wikimedia.org/r/358910 (https://phabricator.wikimedia.org/T1075) [16:50:56] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack/setup/install conf1004-conf1006 - https://phabricator.wikimedia.org/T166081#3387617 (10RobH) irc update: these will need to be installed with jessie [16:51:52] (03CR) 10BBlack: "ema was saying +1 month from deploy date (I'm not sure what the driver for the month-long timeline was, something about the cookie expirat" [puppet] - 10https://gerrit.wikimedia.org/r/359417 (https://phabricator.wikimedia.org/T168013) (owner: 10Ema) [16:53:48] (03CR) 10Filippo Giunchedi: [C: 032] role: cleanup CI data in graphite [puppet] - 10https://gerrit.wikimedia.org/r/358910 (https://phabricator.wikimedia.org/T1075) (owner: 10Filippo Giunchedi) [16:53:50] (03CR) 10BBlack: [C: 031] "(and perhaps we should look at the other SPF lines in our repo, too)" [dns] - 10https://gerrit.wikimedia.org/r/358132 (https://phabricator.wikimedia.org/T133191) (owner: 10Herron) [16:54:31] andrewbogott: merging your change too [16:54:38] godog: thanks [16:57:40] (03CR) 10BBlack: [C: 032] add numactl to base packages [puppet] - 10https://gerrit.wikimedia.org/r/359488 (owner: 10BBlack) [16:57:47] (03PS2) 10BBlack: add numactl to base packages [puppet] - 10https://gerrit.wikimedia.org/r/359488 [16:57:50] (03CR) 10BBlack: [V: 032 C: 032] add numactl to base packages [puppet] - 10https://gerrit.wikimedia.org/r/359488 (owner: 10BBlack) [16:58:57] (03PS1) 10RobH: setting install params for labcontrol100[34] [puppet] - 10https://gerrit.wikimedia.org/r/361904 [17:01:29] 10Operations, 10ops-eqiad: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3387651 (10RobH) I emailed out about fixing the disk detection issue a couple of days ago, still awaiting some kind of feedback on the rootdelay versus systemd solution. If no answer is provided shortly, I've a... [17:03:38] PROBLEM - Juniper alarms on mr1-eqiad is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.154.199 [17:03:47] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.199 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [17:03:54] (03CR) 10RobH: [C: 032] setting install params for labcontrol100[34] [puppet] - 10https://gerrit.wikimedia.org/r/361904 (owner: 10RobH) [17:04:08] (03PS2) 10RobH: setting install params for labcontrol100[34] [puppet] - 10https://gerrit.wikimedia.org/r/361904 [17:04:27] RECOVERY - Juniper alarms on mr1-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [17:04:37] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 38, down: 0, dormant: 0, excluded: 0, unused: 0 [17:09:21] (03PS4) 10BBlack: tlsproxy: restrict whole daemon to relevant NUMA node(s) [puppet] - 10https://gerrit.wikimedia.org/r/359489 [17:12:09] (03PS2) 10Urbanecm: Enable autopatrol flag on ptwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361695 (https://phabricator.wikimedia.org/T168981) [17:12:24] (03PS3) 10Urbanecm: Enable autopatrol flag on ptwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361695 (https://phabricator.wikimedia.org/T168981) [17:16:13] (03CR) 10BBlack: [C: 032] tlsproxy: restrict whole daemon to relevant NUMA node(s) [puppet] - 10https://gerrit.wikimedia.org/r/359489 (owner: 10BBlack) [17:17:07] PROBLEM - puppet last run on cp4021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:19:07] RECOVERY - puppet last run on cp4021 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [17:27:02] 10Operations, 10ops-eqiad, 10User-fgiunchedi: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T167264#3387744 (10fgiunchedi) 05Open>03Resolved Disk has been swapped by @Cmjohnson and is rebuilding! [17:28:36] (03PS1) 10BBlack: nginx-numa: fix template bug [puppet] - 10https://gerrit.wikimedia.org/r/361907 [17:31:06] (03CR) 10BBlack: [C: 032] nginx-numa: fix template bug [puppet] - 10https://gerrit.wikimedia.org/r/361907 (owner: 10BBlack) [17:35:44] 10Operations, 10Discovery-Analysis: Upgrade pandoc package to at least 1.12.3 - https://phabricator.wikimedia.org/T168683#3387781 (10mpopov) This is for stat100* nodes (that use https://github.com/wikimedia/puppet/tree/production/modules/statistics) @Gehel: we had a similar issue with R where the version in t... [17:37:57] 10Operations, 10ops-eqiad: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3387796 (10faidon) Let's see if stretch without the rootdelay works -- if not, feel free to add the rootdelay to unblock this and I'll try to take one of the machines out of rotation to investigate this further. [17:38:42] 10Operations, 10ops-eqiad: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3387797 (10RobH) Sounds good! I'll work on re-imaging these shortly with stretch. Will test without root delay first. [17:54:56] (03PS1) 10Aaron Schulz: Set $wgTrxProfilerLimits[PostSend] to avoid notices for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361914 [17:55:46] (03CR) 10Dzahn: [C: 031] "follow-up patch sounds better than just a single one to me" [puppet] - 10https://gerrit.wikimedia.org/r/327028 (owner: 10Legoktm) [17:57:19] (03PS2) 10Aaron Schulz: Set $wgTrxProfilerLimits[PostSend] to avoid notices for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361914 [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170628T1800). Please do the needful. [18:00:04] Urbanecm: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:11] Present [18:00:17] (03CR) 10Aaron Schulz: [C: 032] Set $wgTrxProfilerLimits[PostSend] to avoid notices for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361914 (owner: 10Aaron Schulz) [18:00:27] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [18:02:17] !log kartik@tin Started deploy [cxserver/deploy@894e3fe]: (no justification provided) [18:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:26] (03Merged) 10jenkins-bot: Set $wgTrxProfilerLimits[PostSend] to avoid notices for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361914 (owner: 10Aaron Schulz) [18:02:41] (03CR) 10jenkins-bot: Set $wgTrxProfilerLimits[PostSend] to avoid notices for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361914 (owner: 10Aaron Schulz) [18:03:27] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [18:04:20] !log kartik@tin Finished deploy [cxserver/deploy@894e3fe]: (no justification provided) (duration: 02m 03s) [18:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:58] (03PS1) 10Alexandros Kosiaris: Update install1002's IP address [puppet] - 10https://gerrit.wikimedia.org/r/361915 [18:05:21] !log aaron@tin Synchronized wmf-config/CommonSettings.php: Set $wgTrxProfilerLimits[PostSend] to avoid notices for now (duration: 00m 47s) [18:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:51] Urbanecm: I can SWAT. [18:05:56] Great! [18:06:09] (03PS4) 10Thcipriani: Enable autopatrol flag on ptwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361695 (https://phabricator.wikimedia.org/T168981) (owner: 10Urbanecm) [18:06:17] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361695 (https://phabricator.wikimedia.org/T168981) (owner: 10Urbanecm) [18:06:30] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Update install1002's IP address [puppet] - 10https://gerrit.wikimedia.org/r/361915 (owner: 10Alexandros Kosiaris) [18:07:43] (03Merged) 10jenkins-bot: Enable autopatrol flag on ptwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361695 (https://phabricator.wikimedia.org/T168981) (owner: 10Urbanecm) [18:07:58] (03CR) 10jenkins-bot: Enable autopatrol flag on ptwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361695 (https://phabricator.wikimedia.org/T168981) (owner: 10Urbanecm) [18:09:38] Urbanecm: live on mwdebug1002, check please [18:11:42] thcipriani: Working, deploy to the universe please! [18:11:49] * thcipriani does [18:13:34] (03PS2) 10Andrew Bogott: Restore labnet-users access to nova hosts [puppet] - 10https://gerrit.wikimedia.org/r/361786 (https://phabricator.wikimedia.org/T169018) (owner: 10Hashar) [18:13:38] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:361695|Enable autopatrol flag on ptwikivoyage]] T168981 (duration: 00m 47s) [18:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:47] T168981: Create a autopatrol flag on ptwikivoyage - https://phabricator.wikimedia.org/T168981 [18:13:53] ^ Urbanecm live now [18:14:02] Great! [18:14:26] Working. Thank you! [18:14:27] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6 [18:14:27] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6 [18:14:37] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6 [18:14:47] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6 [18:14:48] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6 [18:14:48] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6 [18:14:48] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6 [18:14:48] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6 [18:14:57] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6 [18:15:07] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6 [18:15:07] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6 [18:15:07] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6 [18:15:07] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6 [18:15:17] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6 [18:15:17] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6 [18:15:17] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6 [18:15:17] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6 [18:15:17] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6 [18:15:17] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6 [18:15:17] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6 [18:15:18] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6 [18:15:35] (03CR) 10Jdlrobson: [C: 031] "The cookies will still exist for the next month but will not impact the mobile view in any way. Even when cookie enabled the HTML should b" [puppet] - 10https://gerrit.wikimedia.org/r/359417 (https://phabricator.wikimedia.org/T168013) (owner: 10Ema) [18:15:38] chill icinga-wm, it's just that one host at the end, 4021 [18:18:26] (which i cant get on mgmt and is a temp entry for testing new cache node hardware setup, per comment in site.pp) [18:19:19] 10Operations, 10Wikibase-DataModel, 10Wikidata, 10Patch-For-Review, 10Wikidata-Sprint: Remove left-over alias for wikidata.org/ontology (doesn't work) - https://phabricator.wikimedia.org/T169023#3387908 (10thiemowmde) p:05Triage>03Lowest [18:23:17] (03CR) 10Andrew Bogott: [C: 032] Restore labnet-users access to nova hosts [puppet] - 10https://gerrit.wikimedia.org/r/361786 (https://phabricator.wikimedia.org/T169018) (owner: 10Hashar) [18:24:33] (03CR) 10Thiemo Mättig (WMDE): [C: 04-1] "The correct URL that actually resolves is http://wikiba.se/ontology" [puppet] - 10https://gerrit.wikimedia.org/r/361801 (https://phabricator.wikimedia.org/T169023) (owner: 10Krinkle) [18:24:43] i selected the icinga alerts above in web ui, then "Acknowledge checked service problems" but unchecked the "send notification" box before hitting submit. that way the Web UI is cleaned up but it doesn't cause more ACK lines here on IRC [18:25:47] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 58 ESP OK [18:25:47] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 58 ESP OK [18:25:47] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 58 ESP OK [18:25:48] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 58 ESP OK [18:25:48] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 72 ESP OK [18:25:57] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 72 ESP OK [18:26:07] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 72 ESP OK [18:26:07] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 58 ESP OK [18:26:07] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 58 ESP OK [18:26:07] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 58 ESP OK [18:26:17] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 58 ESP OK [18:26:21] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 72 ESP OK [18:26:21] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 72 ESP OK [18:26:21] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 72 ESP OK [18:26:21] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 72 ESP OK [18:26:21] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 72 ESP OK [18:26:21] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 72 ESP OK [18:26:21] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 72 ESP OK [18:26:21] !log joal@tin Started deploy [analytics/refinery@f6cccf9]: Regular deploy - One week late- Big changes [18:26:27] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 58 ESP OK [18:26:27] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 58 ESP OK [18:26:27] andrewbogott: labnet1001 - "nova instance creation test" says something is not running. python .. nova-fullstack. known? [18:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:37] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 58 ESP OK [18:26:56] mutante: Not super important but I will look [18:27:08] andrewbogott: 'k, cool [18:27:23] 10Operations, 10Wikibase-DataModel, 10Wikidata, 10Patch-For-Review, 10Wikidata-Sprint: Remove left-over alias for wikidata.org/ontology (doesn't work) - https://phabricator.wikimedia.org/T169023#3384741 (10thiemowmde) Please redirect to http://wikiba.se/ontology [18:31:06] Y [18:31:09] !log joal@tin Finished deploy [analytics/refinery@f6cccf9]: Regular deploy - One week late- Big changes (duration: 04m 49s) [18:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:27] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [18:33:58] (03PS1) 10RobH: fixing the partman use for labcontrol100[34] [puppet] - 10https://gerrit.wikimedia.org/r/361922 [18:34:31] (03CR) 10RobH: [C: 032] fixing the partman use for labcontrol100[34] [puppet] - 10https://gerrit.wikimedia.org/r/361922 (owner: 10RobH) [18:36:24] (03PS1) 10Andrew Bogott: Nova fullstack: Increase timeouts even more [puppet] - 10https://gerrit.wikimedia.org/r/361924 (https://phabricator.wikimedia.org/T165555) [18:36:43] (03CR) 10Krinkle: "Yes, so why -1 ?" [puppet] - 10https://gerrit.wikimedia.org/r/361801 (https://phabricator.wikimedia.org/T169023) (owner: 10Krinkle) [18:37:16] (03PS3) 10Dzahn: parsoid/visualdiff: Use packaged uprightdiff in testreduce and visualdiff [puppet] - 10https://gerrit.wikimedia.org/r/327028 (owner: 10Legoktm) [18:38:56] (03PS4) 10Dzahn: parsoid/visualdiff: Use packaged uprightdiff in testreduce and visualdiff [puppet] - 10https://gerrit.wikimedia.org/r/327028 (owner: 10Legoktm) [18:39:19] (03PS2) 10Andrew Bogott: Nova fullstack: Increase timeouts even more [puppet] - 10https://gerrit.wikimedia.org/r/361924 (https://phabricator.wikimedia.org/T165555) [18:40:33] (03CR) 10Dzahn: [C: 032] "only affects ruthenium, the test server, not wtp* (http://puppet-compiler.wmflabs.org/6883/)" [puppet] - 10https://gerrit.wikimedia.org/r/327028 (owner: 10Legoktm) [18:45:12] (03PS3) 10Andrew Bogott: Nova fullstack: Increase timeouts even more [puppet] - 10https://gerrit.wikimedia.org/r/361924 (https://phabricator.wikimedia.org/T165555) [18:45:20] (03CR) 10Ema: Add firewall rules for pinkunicorn (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/361844 (https://phabricator.wikimedia.org/T169039) (owner: 10Ema) [18:47:25] (03CR) 10Dzahn: "/usr/local/bin/uprightdiff and /usr/bin/uprightdiff both exists on ruthenium and puppet change has now been applied" [puppet] - 10https://gerrit.wikimedia.org/r/327028 (owner: 10Legoktm) [18:48:56] (03PS2) 10Dzahn: icinga: final remove of check_ram.sh remnant [puppet] - 10https://gerrit.wikimedia.org/r/361799 [18:50:29] (03PS3) 10Dzahn: icinga: final remove of check_ram.sh remnant [puppet] - 10https://gerrit.wikimedia.org/r/361799 [18:51:36] (03CR) 10Andrew Bogott: [C: 032] Nova fullstack: Increase timeouts even more [puppet] - 10https://gerrit.wikimedia.org/r/361924 (https://phabricator.wikimedia.org/T165555) (owner: 10Andrew Bogott) [18:53:59] (03CR) 10Dzahn: [C: 032] "checked with cumin, looks all gone" [puppet] - 10https://gerrit.wikimedia.org/r/361799 (owner: 10Dzahn) [18:54:09] (03PS4) 10Dzahn: icinga: final remove of check_ram.sh remnant [puppet] - 10https://gerrit.wikimedia.org/r/361799 [19:00:05] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170628T1900). [19:00:16] !log starting load testing of elasticsearch in codfw [19:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:48] (03CR) 10Framawiki: [C: 031] "Looks good for me !" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341267 (https://phabricator.wikimedia.org/T159664) (owner: 10BearND) [19:02:58] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labcontrol100[34] - https://phabricator.wikimedia.org/T165781#3388067 (10RobH) So I have these setup and loading into the installer, but grub fails: > Jun 28 18:53:20 grub-installer: info: Installing grub... [19:03:11] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labcontrol100[34] - https://phabricator.wikimedia.org/T165781#3388075 (10RobH) [19:14:53] (03PS1) 1020after4: group1 wikis to 1.30.0-wmf.7 refs T167536 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361934 [19:14:55] (03CR) 1020after4: [C: 032] group1 wikis to 1.30.0-wmf.7 refs T167536 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361934 (owner: 1020after4) [19:17:37] (03Merged) 10jenkins-bot: group1 wikis to 1.30.0-wmf.7 refs T167536 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361934 (owner: 1020after4) [19:17:45] !log cherry-picked https://gerrit.wikimedia.org/r/#/c/361935/ to wmf.7 refs T168899 + T167536 [19:17:46] (03CR) 10jenkins-bot: group1 wikis to 1.30.0-wmf.7 refs T167536 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361934 (owner: 1020after4) [19:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:56] T168899: Fatal error: Class undefined: LoginNotifyPresentationModel in extensions/Echo/includes/formatters/EventPresentationModel.php on line 99 - https://phabricator.wikimedia.org/T168899 [19:17:56] T167536: MW-1.30.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T167536 [19:19:29] (03PS1) 10RobH: install stretch on wtp1025 through wtp1048 [puppet] - 10https://gerrit.wikimedia.org/r/361937 [19:20:47] (03CR) 10RobH: [C: 032] install stretch on wtp1025 through wtp1048 [puppet] - 10https://gerrit.wikimedia.org/r/361937 (owner: 10RobH) [19:24:32] 10Operations, 10MW-1.30-release-notes, 10Traffic, 10HTTPS, and 2 others: Enable HTTPS for swift clients - https://phabricator.wikimedia.org/T160616#3388198 (10aaron) Looks like SwiftRepl is the last element of this task. @fgiunchedi , how difficult does look to add to the replication script? I know you sa... [19:26:00] !log twentyafterfour@tin Synchronized php-1.30.0-wmf.7/extensions/LoginNotify/includes/Hooks.php: deploy https://gerrit.wikimedia.org/r/#/c/361935/ to wmf.7 refs T168899 + T167536 (duration: 00m 45s) [19:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:10] T168899: Fatal error: Class undefined: LoginNotifyPresentationModel in extensions/Echo/includes/formatters/EventPresentationModel.php on line 99 - https://phabricator.wikimedia.org/T168899 [19:26:11] T167536: MW-1.30.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T167536 [19:26:42] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.30.0-wmf.7 refs T167536 [19:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:32] (03CR) 10BBlack: [C: 031] Add firewall rules for pinkunicorn [puppet] - 10https://gerrit.wikimedia.org/r/361844 (https://phabricator.wikimedia.org/T169039) (owner: 10Ema) [19:29:21] PROBLEM - MariaDB Slave SQL: s2 on db1047 is CRITICAL: CRITICAL slave_sql_state could not connect [19:29:21] PROBLEM - MariaDB Slave IO: s2 on db1047 is CRITICAL: CRITICAL slave_io_state could not connect [19:29:41] PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag could not connect [19:29:41] PROBLEM - MariaDB Slave IO: s1 on db1047 is CRITICAL: CRITICAL slave_io_state could not connect [19:29:42] PROBLEM - MariaDB Slave SQL: s1 on db1047 is CRITICAL: CRITICAL slave_sql_state could not connect [19:31:31] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:31:41] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [19:34:31] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [19:35:40] (03CR) 10BearND: "Wouldn't this just ride the train once it gets merged?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341267 (https://phabricator.wikimedia.org/T159664) (owner: 10BearND) [19:36:56] !log puppet suspended on install1002 for robh to livehack the dhcp file for a single reboot of wtp1025 [19:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:34] (03CR) 10Thiemo Mättig (WMDE): [C: 04-1] "Because this patch appears to remove instead of fix it." [puppet] - 10https://gerrit.wikimedia.org/r/361801 (https://phabricator.wikimedia.org/T169023) (owner: 10Krinkle) [19:38:24] are those mariadb errors potentially deployment related? [19:38:31] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:39:31] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:39:51] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170628T2000). [20:00:30] Nothing for ORES today [20:00:31] no parsoid deploy today [20:08:13] !log migrating servermon to stretch on netmon1002 is currently blocked by "python-django-south" package not existing anymore [20:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:07] !log twentyafterfour@tin Synchronized php-1.30.0-wmf.7/extensions/VisualEditor/VisualEditor.hooks.php: sync https://gerrit.wikimedia.org/r/#/c/361941/ refs T169132 T167536 (duration: 00m 47s) [20:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:19] T167536: MW-1.30.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T167536 [20:17:20] T169132: Argument 2 passed to VisualEditorHooks::onDiffViewHeader() must be an instance of Revision, null given - https://phabricator.wikimedia.org/T169132 [20:20:48] 10Operations, 10ArchCom-RfC, 10Traffic, 10Services (designing): Make API usage limits easier to understand, implement, and more adaptive to varying request costs / concurrency limiting - https://phabricator.wikimedia.org/T167906#3388413 (10GWicke) @anomie, this RFC is primarily about better addressing the... [20:33:03] 10Operations, 10Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3388460 (10Dzahn) I tried to move on with the servermon role next, tested the role on stretch on a labs instance. Problem: ``` Package python-django-south is not available, but is referred to by anot... [20:37:01] 10Operations, 10Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3388494 (10Dzahn) [20:37:36] 10Operations, 10Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3077541 (10Dzahn) [20:46:31] PROBLEM - Host wtp1048 is DOWN: PING CRITICAL - Packet loss = 100% [20:46:31] PROBLEM - Host wtp1047 is DOWN: PING CRITICAL - Packet loss = 100% [20:47:05] are these known issues or...? ^^ [20:51:41] RECOVERY - Host wtp1048 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [20:51:41] RECOVERY - Host wtp1047 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [20:53:41] PROBLEM - MD RAID on wtp1047 is CRITICAL: Return code of 255 is out of bounds [20:53:41] PROBLEM - MD RAID on wtp1048 is CRITICAL: Return code of 255 is out of bounds [20:53:48] "Request from 198.73.209.4 via cp4016 cp4016, Varnish XID 115864684 [20:53:48] Error: 503, Backend fetch failed at Wed, 28 Jun 2017 20:53:25 GMT" [20:53:51] PROBLEM - dhclient process on wtp1047 is CRITICAL: Return code of 255 is out of bounds [20:53:51] PROBLEM - salt-minion processes on wtp1048 is CRITICAL: Return code of 255 is out of bounds [20:53:51] PROBLEM - DPKG on wtp1047 is CRITICAL: Return code of 255 is out of bounds [20:53:51] PROBLEM - configured eth on wtp1048 is CRITICAL: Return code of 255 is out of bounds [20:53:51] PROBLEM - configured eth on wtp1047 is CRITICAL: Return code of 255 is out of bounds [20:54:01] PROBLEM - SSH on wtp1047 is CRITICAL: connect to address 10.64.48.165 and port 22: Connection refused [20:54:01] PROBLEM - puppet last run on wtp1048 is CRITICAL: Return code of 255 is out of bounds [20:54:01] PROBLEM - puppet last run on wtp1047 is CRITICAL: Return code of 255 is out of bounds [20:54:02] PROBLEM - Check systemd state on wtp1048 is CRITICAL: Return code of 255 is out of bounds [20:54:02] PROBLEM - SSH on wtp1048 is CRITICAL: connect to address 10.64.48.166 and port 22: Connection refused [20:54:08] (just got this) [20:54:11] PROBLEM - Disk space on wtp1047 is CRITICAL: Return code of 255 is out of bounds [20:54:21] PROBLEM - dhclient process on wtp1048 is CRITICAL: Return code of 255 is out of bounds [20:54:21] PROBLEM - salt-minion processes on wtp1047 is CRITICAL: Return code of 255 is out of bounds [20:54:21] PROBLEM - Check systemd state on wtp1047 is CRITICAL: Return code of 255 is out of bounds [20:54:21] PROBLEM - DPKG on wtp1048 is CRITICAL: Return code of 255 is out of bounds [20:54:41] PROBLEM - Disk space on wtp1048 is CRITICAL: Return code of 255 is out of bounds [20:57:57] bleh [20:58:01] why are you alerting stupid ssytems [20:58:05] i never signed your puppet keys!??@ [20:58:19] fuck [20:58:38] ok so that's you then? gtk [20:59:13] abartov: do you have more of that or was it a one-off? [21:00:59] apergos: no, a re-submit saved the edit without problems. [21:01:03] ok [21:01:07] yeah all me [21:01:17] those alerts above are bogus so let's assme for now it was just a blip [21:01:22] i installed jessie on them all, but left without puppet signed cuz i was gonna reinstall them with stretch [21:01:33] ah makes sense [21:01:44] i think someone must have signed one thinking it was ok, or something, not sure [21:01:55] cuz 47 and 48 alert but rest have pending keys still [21:01:57] oh well [21:02:00] lol [21:02:16] (03PS1) 10Dzahn: librenms: add support for stretch, adjust (PHP) packages [puppet] - 10https://gerrit.wikimedia.org/r/362014 [21:03:20] (03CR) 10jerkins-bot: [V: 04-1] librenms: add support for stretch, adjust (PHP) packages [puppet] - 10https://gerrit.wikimedia.org/r/362014 (owner: 10Dzahn) [21:04:53] (03PS2) 10Dzahn: librenms: add support for stretch, adjust (PHP) packages [puppet] - 10https://gerrit.wikimedia.org/r/362014 [21:06:02] (03CR) 10jerkins-bot: [V: 04-1] librenms: add support for stretch, adjust (PHP) packages [puppet] - 10https://gerrit.wikimedia.org/r/362014 (owner: 10Dzahn) [21:06:45] (03PS3) 10Dzahn: librenms: add support for stretch, adjust (PHP) packages [puppet] - 10https://gerrit.wikimedia.org/r/362014 [21:09:26] @seen hashar [21:09:26] mutante: Last time I saw hashar they were quitting the network with reason: Quit: Textual IRC Client: www.textualapp.com N/A at 6/28/2017 4:48:48 PM (4h20m38s ago) [21:11:01] RECOVERY - SSH on wtp1047 is OK: SSH OK - OpenSSH_7.4p1 Debian-10 (protocol 2.0) [21:11:11] RECOVERY - SSH on wtp1048 is OK: SSH OK - OpenSSH_7.4p1 Debian-10 (protocol 2.0) [21:12:21] (03PS4) 10Dzahn: librenms: add support for stretch, adjust (PHP) packages [puppet] - 10https://gerrit.wikimedia.org/r/362014 [21:14:25] (03PS5) 10Dzahn: librenms: add support for stretch, adjust (PHP) packages [puppet] - 10https://gerrit.wikimedia.org/r/362014 [21:14:41] PROBLEM - Check the NTP synchronisation status of timesyncd on wtp1048 is CRITICAL: Return code of 255 is out of bounds [21:16:31] PROBLEM - Check the NTP synchronisation status of timesyncd on wtp1047 is CRITICAL: Return code of 255 is out of bounds [21:19:58] (03CR) 10MaxSem: "I ininitially said 1 month due to HTTP caching, however if hashing makes sure no pages without images get served, this is good to go, from" [puppet] - 10https://gerrit.wikimedia.org/r/359417 (https://phabricator.wikimedia.org/T168013) (owner: 10Ema) [21:22:42] (03CR) 10Dzahn: "snmp-mibs-downloader is in non-free" [puppet] - 10https://gerrit.wikimedia.org/r/362014 (owner: 10Dzahn) [21:26:38] (03PS6) 10Dzahn: librenms: add support for stretch, adjust (PHP) packages [puppet] - 10https://gerrit.wikimedia.org/r/362014 [21:28:03] (03PS3) 10Chad: Use rsync::quickdatacopy for copying data between old-new release servers [puppet] - 10https://gerrit.wikimedia.org/r/361812 [21:28:35] (03PS1) 10Chad: Kill OAI log channel, extension long since undeployed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362088 [21:29:30] (03CR) 10jerkins-bot: [V: 04-1] Use rsync::quickdatacopy for copying data between old-new release servers [puppet] - 10https://gerrit.wikimedia.org/r/361812 (owner: 10Chad) [21:30:05] (03CR) 10Chad: [C: 032] Kill OAI log channel, extension long since undeployed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362088 (owner: 10Chad) [21:31:18] (03Merged) 10jenkins-bot: Kill OAI log channel, extension long since undeployed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362088 (owner: 10Chad) [21:31:30] (03CR) 10jenkins-bot: Kill OAI log channel, extension long since undeployed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362088 (owner: 10Chad) [21:32:37] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3388698 (10RobH) OS installed on all hosts, and it seems the disk spin up issue doesn't happen in stretch! I rebooted(cold/warm/etc) a few of the hosts a few times each and didn't reproduc... [21:33:31] PROBLEM - IPMI Temperature on wtp1048 is CRITICAL: Return code of 255 is out of bounds [21:34:34] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: kill oai logging channel (duration: 00m 47s) [21:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:21] PROBLEM - IPMI Temperature on wtp1047 is CRITICAL: Return code of 255 is out of bounds [21:37:26] (03PS4) 10Chad: Use rsync::quickdatacopy for copying data between old-new release servers [puppet] - 10https://gerrit.wikimedia.org/r/361812 [21:37:43] !log ppchelko@tin Started deploy [eventstreams/deploy@05bcc8f]: redeploy to pick up config changes [21:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:04] !log ppchelko@tin Finished deploy [eventstreams/deploy@05bcc8f]: redeploy to pick up config changes (duration: 00m 20s) [21:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:15] 10Operations, 10ops-eqiad: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3388733 (10RobH) [21:42:39] (03PS1) 10Chad: Remove weird testwiki logging config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362090 [21:43:35] PROBLEM - MD RAID on wtp1035 is CRITICAL: Return code of 255 is out of bounds [21:43:35] PROBLEM - puppet last run on wtp1026 is CRITICAL: Return code of 255 is out of bounds [21:43:35] PROBLEM - salt-minion processes on wtp1033 is CRITICAL: Return code of 255 is out of bounds [21:43:35] PROBLEM - Check systemd state on wtp1039 is CRITICAL: Return code of 255 is out of bounds [21:43:35] PROBLEM - Check the NTP synchronisation status of timesyncd on wtp1043 is CRITICAL: Return code of 255 is out of bounds [21:43:35] PROBLEM - configured eth on wtp1044 is CRITICAL: Return code of 255 is out of bounds [21:43:35] PROBLEM - DPKG on wtp1046 is CRITICAL: Return code of 255 is out of bounds [21:43:45] PROBLEM - MD RAID on wtp1029 is CRITICAL: Return code of 255 is out of bounds [21:43:45] PROBLEM - Check systemd state on wtp1034 is CRITICAL: Return code of 255 is out of bounds [21:43:45] PROBLEM - Check the NTP synchronisation status of timesyncd on wtp1039 is CRITICAL: Return code of 255 is out of bounds [21:43:46] PROBLEM - configured eth on wtp1040 is CRITICAL: Return code of 255 is out of bounds [21:43:46] PROBLEM - DPKG on wtp1043 is CRITICAL: Return code of 255 is out of bounds [21:43:46] PROBLEM - dhclient process on wtp1044 is CRITICAL: Return code of 255 is out of bounds [21:43:46] PROBLEM - Disk space on wtp1046 is CRITICAL: Return code of 255 is out of bounds [21:44:03] (03CR) 10Chad: [C: 032] Remove weird testwiki logging config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362090 (owner: 10Chad) [21:44:05] PROBLEM - Check the NTP synchronisation status of timesyncd on wtp1034 is CRITICAL: Return code of 255 is out of bounds [21:44:05] PROBLEM - configured eth on wtp1035 is CRITICAL: Return code of 255 is out of bounds [21:44:05] PROBLEM - DPKG on wtp1039 is CRITICAL: Return code of 255 is out of bounds [21:44:05] PROBLEM - dhclient process on wtp1040 is CRITICAL: Return code of 255 is out of bounds [21:44:05] PROBLEM - puppet last run on wtp1044 is CRITICAL: Return code of 255 is out of bounds [21:44:05] PROBLEM - Disk space on wtp1043 is CRITICAL: Return code of 255 is out of bounds [21:44:25] PROBLEM - configured eth on wtp1029 is CRITICAL: Return code of 255 is out of bounds [21:44:25] PROBLEM - DPKG on wtp1034 is CRITICAL: Return code of 255 is out of bounds [21:44:25] PROBLEM - dhclient process on wtp1035 is CRITICAL: Return code of 255 is out of bounds [21:44:25] PROBLEM - Disk space on wtp1039 is CRITICAL: Return code of 255 is out of bounds [21:44:25] PROBLEM - puppet last run on wtp1040 is CRITICAL: Return code of 255 is out of bounds [21:44:25] PROBLEM - salt-minion processes on wtp1044 is CRITICAL: Return code of 255 is out of bounds [21:44:25] PROBLEM - MD RAID on wtp1046 is CRITICAL: Return code of 255 is out of bounds [21:44:35] RECOVERY - puppet last run on wtp1026 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [21:44:45] PROBLEM - dhclient process on wtp1029 is CRITICAL: Return code of 255 is out of bounds [21:44:45] PROBLEM - Disk space on wtp1034 is CRITICAL: Return code of 255 is out of bounds [21:44:45] PROBLEM - puppet last run on wtp1035 is CRITICAL: Return code of 255 is out of bounds [21:44:45] PROBLEM - MD RAID on wtp1043 is CRITICAL: Return code of 255 is out of bounds [21:44:45] PROBLEM - Check systemd state on wtp1045 is CRITICAL: Return code of 255 is out of bounds [21:44:45] PROBLEM - salt-minion processes on wtp1040 is CRITICAL: Return code of 255 is out of bounds [21:44:55] PROBLEM - puppet last run on wtp1029 is CRITICAL: Return code of 255 is out of bounds [21:44:55] PROBLEM - salt-minion processes on wtp1035 is CRITICAL: Return code of 255 is out of bounds [21:44:55] PROBLEM - MD RAID on wtp1039 is CRITICAL: Return code of 255 is out of bounds [21:44:55] PROBLEM - Check the NTP synchronisation status of timesyncd on wtp1045 is CRITICAL: Return code of 255 is out of bounds [21:44:56] PROBLEM - Check systemd state on wtp1042 is CRITICAL: Return code of 255 is out of bounds [21:44:56] PROBLEM - configured eth on wtp1046 is CRITICAL: Return code of 255 is out of bounds [21:45:04] (03Merged) 10jenkins-bot: Remove weird testwiki logging config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362090 (owner: 10Chad) [21:45:17] PROBLEM - salt-minion processes on wtp1029 is CRITICAL: Return code of 255 is out of bounds [21:45:17] PROBLEM - MD RAID on wtp1034 is CRITICAL: Return code of 255 is out of bounds [21:45:17] PROBLEM - Check systemd state on wtp1037 is CRITICAL: Return code of 255 is out of bounds [21:45:18] PROBLEM - Check the NTP synchronisation status of timesyncd on wtp1042 is CRITICAL: Return code of 255 is out of bounds [21:45:18] PROBLEM - configured eth on wtp1043 is CRITICAL: Return code of 255 is out of bounds [21:45:18] PROBLEM - DPKG on wtp1045 is CRITICAL: Return code of 255 is out of bounds [21:45:18] PROBLEM - dhclient process on wtp1046 is CRITICAL: Return code of 255 is out of bounds [21:45:35] PROBLEM - Check systemd state on wtp1033 is CRITICAL: Return code of 255 is out of bounds [21:45:35] PROBLEM - Check the NTP synchronisation status of timesyncd on wtp1037 is CRITICAL: Return code of 255 is out of bounds [21:45:35] PROBLEM - DPKG on wtp1042 is CRITICAL: Return code of 255 is out of bounds [21:45:35] PROBLEM - configured eth on wtp1039 is CRITICAL: Return code of 255 is out of bounds [21:45:35] PROBLEM - dhclient process on wtp1043 is CRITICAL: Return code of 255 is out of bounds [21:45:35] PROBLEM - Disk space on wtp1045 is CRITICAL: Return code of 255 is out of bounds [21:45:35] PROBLEM - puppet last run on wtp1046 is CRITICAL: Return code of 255 is out of bounds [21:45:51] (03CR) 10jenkins-bot: Remove weird testwiki logging config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362090 (owner: 10Chad) [21:45:55] PROBLEM - Check the NTP synchronisation status of timesyncd on wtp1033 is CRITICAL: Return code of 255 is out of bounds [21:45:55] PROBLEM - DPKG on wtp1037 is CRITICAL: Return code of 255 is out of bounds [21:45:55] PROBLEM - configured eth on wtp1034 is CRITICAL: Return code of 255 is out of bounds [21:45:55] PROBLEM - dhclient process on wtp1039 is CRITICAL: Return code of 255 is out of bounds [21:45:55] PROBLEM - Disk space on wtp1042 is CRITICAL: Return code of 255 is out of bounds [21:45:56] PROBLEM - puppet last run on wtp1043 is CRITICAL: Return code of 255 is out of bounds [21:45:56] PROBLEM - salt-minion processes on wtp1046 is CRITICAL: Return code of 255 is out of bounds [21:46:05] PROBLEM - DPKG on wtp1033 is CRITICAL: Return code of 255 is out of bounds [21:46:05] PROBLEM - dhclient process on wtp1034 is CRITICAL: Return code of 255 is out of bounds [21:46:05] PROBLEM - salt-minion processes on wtp1043 is CRITICAL: Return code of 255 is out of bounds [21:46:05] PROBLEM - puppet last run on wtp1039 is CRITICAL: Return code of 255 is out of bounds [21:46:05] PROBLEM - MD RAID on wtp1045 is CRITICAL: Return code of 255 is out of bounds [21:46:06] PROBLEM - Disk space on wtp1037 is CRITICAL: Return code of 255 is out of bounds [21:46:25] PROBLEM - puppet last run on wtp1034 is CRITICAL: Return code of 255 is out of bounds [21:46:25] PROBLEM - Disk space on wtp1033 is CRITICAL: Return code of 255 is out of bounds [21:46:25] PROBLEM - salt-minion processes on wtp1039 is CRITICAL: Return code of 255 is out of bounds [21:46:25] PROBLEM - MD RAID on wtp1042 is CRITICAL: Return code of 255 is out of bounds [21:46:25] PROBLEM - Check systemd state on wtp1044 is CRITICAL: Return code of 255 is out of bounds [21:46:45] PROBLEM - salt-minion processes on wtp1034 is CRITICAL: Return code of 255 is out of bounds [21:46:45] PROBLEM - MD RAID on wtp1037 is CRITICAL: Return code of 255 is out of bounds [21:46:45] PROBLEM - Check systemd state on wtp1040 is CRITICAL: Return code of 255 is out of bounds [21:46:45] PROBLEM - Check the NTP synchronisation status of timesyncd on wtp1044 is CRITICAL: Return code of 255 is out of bounds [21:46:45] PROBLEM - configured eth on wtp1045 is CRITICAL: Return code of 255 is out of bounds [21:46:55] PROBLEM - MD RAID on wtp1033 is CRITICAL: Return code of 255 is out of bounds [21:46:56] PROBLEM - Check systemd state on wtp1035 is CRITICAL: Return code of 255 is out of bounds [21:46:56] PROBLEM - configured eth on wtp1042 is CRITICAL: Return code of 255 is out of bounds [21:46:56] PROBLEM - Check the NTP synchronisation status of timesyncd on wtp1040 is CRITICAL: Return code of 255 is out of bounds [21:46:56] PROBLEM - dhclient process on wtp1045 is CRITICAL: Return code of 255 is out of bounds [21:46:56] PROBLEM - DPKG on wtp1044 is CRITICAL: Return code of 255 is out of bounds [21:47:15] PROBLEM - Check systemd state on wtp1029 is CRITICAL: Return code of 255 is out of bounds [21:47:15] PROBLEM - configured eth on wtp1037 is CRITICAL: Return code of 255 is out of bounds [21:47:15] PROBLEM - Check the NTP synchronisation status of timesyncd on wtp1035 is CRITICAL: Return code of 255 is out of bounds [21:47:15] PROBLEM - DPKG on wtp1040 is CRITICAL: Return code of 255 is out of bounds [21:47:15] PROBLEM - dhclient process on wtp1042 is CRITICAL: Return code of 255 is out of bounds [21:47:15] PROBLEM - Disk space on wtp1044 is CRITICAL: Return code of 255 is out of bounds [21:47:15] PROBLEM - puppet last run on wtp1045 is CRITICAL: Return code of 255 is out of bounds [21:47:35] PROBLEM - Check the NTP synchronisation status of timesyncd on wtp1029 is CRITICAL: Return code of 255 is out of bounds [21:47:35] PROBLEM - configured eth on wtp1033 is CRITICAL: Return code of 255 is out of bounds [21:47:35] PROBLEM - dhclient process on wtp1037 is CRITICAL: Return code of 255 is out of bounds [21:47:35] PROBLEM - Disk space on wtp1040 is CRITICAL: Return code of 255 is out of bounds [21:47:35] PROBLEM - DPKG on wtp1035 is CRITICAL: Return code of 255 is out of bounds [21:47:35] PROBLEM - puppet last run on wtp1042 is CRITICAL: Return code of 255 is out of bounds [21:47:35] PROBLEM - salt-minion processes on wtp1045 is CRITICAL: Return code of 255 is out of bounds [21:47:55] PROBLEM - DPKG on wtp1029 is CRITICAL: Return code of 255 is out of bounds [21:47:55] PROBLEM - dhclient process on wtp1033 is CRITICAL: Return code of 255 is out of bounds [21:47:55] PROBLEM - Disk space on wtp1035 is CRITICAL: Return code of 255 is out of bounds [21:47:55] PROBLEM - puppet last run on wtp1037 is CRITICAL: Return code of 255 is out of bounds [21:47:55] PROBLEM - salt-minion processes on wtp1042 is CRITICAL: Return code of 255 is out of bounds [21:47:55] PROBLEM - MD RAID on wtp1044 is CRITICAL: Return code of 255 is out of bounds [21:47:55] PROBLEM - Check systemd state on wtp1046 is CRITICAL: Return code of 255 is out of bounds [21:48:05] PROBLEM - Disk space on wtp1029 is CRITICAL: Return code of 255 is out of bounds [21:48:05] PROBLEM - puppet last run on wtp1033 is CRITICAL: Return code of 255 is out of bounds [21:48:05] PROBLEM - MD RAID on wtp1040 is CRITICAL: Return code of 255 is out of bounds [21:48:06] PROBLEM - salt-minion processes on wtp1037 is CRITICAL: Return code of 255 is out of bounds [21:48:06] PROBLEM - Check systemd state on wtp1043 is CRITICAL: Return code of 255 is out of bounds [21:48:06] PROBLEM - Check the NTP synchronisation status of timesyncd on wtp1046 is CRITICAL: Return code of 255 is out of bounds [21:48:22] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: kill weird testwiki logging (duration: 00m 47s) [21:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:37] whyyyyy [21:49:56] THIS IS OK [21:50:00] they are new wtp sytems [21:50:29] !log wtp1025-1048 are having icinga reporting errors, they are new installs on stretch [21:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:46] (03PS1) 10EBernhardson: Monitor elasticsearch stats for load test [puppet] - 10https://gerrit.wikimedia.org/r/362091 (https://phabricator.wikimedia.org/T169002) [21:56:33] (03PS2) 10Andrew Bogott: nslcd: Remove Labs shell override [puppet] - 10https://gerrit.wikimedia.org/r/361595 (https://phabricator.wikimedia.org/T86668) (owner: 10BryanDavis) [21:59:27] they'll flap a bit more as i troubleshoot each one. i dont want to put in maint mode, since i need to see the flap to clear =] [21:59:34] (03CR) 10Andrew Bogott: [C: 032] nslcd: Remove Labs shell override [puppet] - 10https://gerrit.wikimedia.org/r/361595 (https://phabricator.wikimedia.org/T86668) (owner: 10BryanDavis) [22:01:39] (03PS1) 10Chad: Drop temp-debug: use AdHocDebug instead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362093 [22:02:01] 10Operations, 10ops-eqiad: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3388958 (10RobH) wtp1025-wtp1028 good to go, puppet/salt signed and os installed. [22:02:05] PROBLEM - IPMI Temperature on wtp1046 is CRITICAL: Return code of 255 is out of bounds [22:02:25] PROBLEM - IPMI Temperature on wtp1043 is CRITICAL: Return code of 255 is out of bounds [22:02:35] PROBLEM - IPMI Temperature on wtp1039 is CRITICAL: Return code of 255 is out of bounds [22:02:51] (03PS1) 10Chad: Revert "Enable T143073 debug log channel" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362094 [22:02:55] PROBLEM - IPMI Temperature on wtp1034 is CRITICAL: Return code of 255 is out of bounds [22:03:33] (03CR) 10Chad: [C: 032] Drop temp-debug: use AdHocDebug instead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362093 (owner: 10Chad) [22:03:55] PROBLEM - IPMI Temperature on wtp1045 is CRITICAL: Return code of 255 is out of bounds [22:04:05] PROBLEM - IPMI Temperature on wtp1042 is CRITICAL: Return code of 255 is out of bounds [22:04:05] PROBLEM - SSH on wtp1029 is CRITICAL: connect to address 10.64.0.243 and port 22: Connection refused [22:04:25] PROBLEM - IPMI Temperature on wtp1037 is CRITICAL: Return code of 255 is out of bounds [22:04:33] (03Merged) 10jenkins-bot: Drop temp-debug: use AdHocDebug instead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362093 (owner: 10Chad) [22:04:42] (03PS3) 10Andrew Bogott: Reapply "labtest hiera: use labtestwikitech, not wikitech" [puppet] - 10https://gerrit.wikimedia.org/r/331636 (https://phabricator.wikimedia.org/T145808) (owner: 10Alex Monk) [22:04:45] PROBLEM - IPMI Temperature on wtp1033 is CRITICAL: Return code of 255 is out of bounds [22:04:56] (03Abandoned) 10Chad: Revert "Enable T143073 debug log channel" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362094 (owner: 10Chad) [22:05:35] PROBLEM - IPMI Temperature on wtp1044 is CRITICAL: Return code of 255 is out of bounds [22:05:49] (03CR) 10jenkins-bot: Drop temp-debug: use AdHocDebug instead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362093 (owner: 10Chad) [22:05:55] PROBLEM - IPMI Temperature on wtp1040 is CRITICAL: Return code of 255 is out of bounds [22:06:05] PROBLEM - IPMI Temperature on wtp1035 is CRITICAL: Return code of 255 is out of bounds [22:06:16] (03CR) 10Andrew Bogott: [C: 032] Reapply "labtest hiera: use labtestwikitech, not wikitech" [puppet] - 10https://gerrit.wikimedia.org/r/331636 (https://phabricator.wikimedia.org/T145808) (owner: 10Alex Monk) [22:06:25] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: kill temp-debug (duration: 00m 46s) [22:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:05] !log ppchelko@tin Started deploy [eventstreams/deploy@ba71a84]: redeploy to pick up config changes [22:07:14] (03PS1) 10Chad: Drop 2 task-based log channels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362095 [22:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:22] (03CR) 10Chad: [C: 032] Drop 2 task-based log channels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362095 (owner: 10Chad) [22:08:29] (03Merged) 10jenkins-bot: Drop 2 task-based log channels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362095 (owner: 10Chad) [22:08:37] (03CR) 10jenkins-bot: Drop 2 task-based log channels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362095 (owner: 10Chad) [22:09:06] !log ppchelko@tin Finished deploy [eventstreams/deploy@ba71a84]: redeploy to pick up config changes (duration: 02m 01s) [22:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:10] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: rm more stupid logging, wow this stuff has piled up (duration: 00m 46s) [22:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:49] (03PS1) 10Awight: Tweak config file user. [puppet] - 10https://gerrit.wikimedia.org/r/362097 [22:18:04] (03CR) 10Chad: Tweak config file user. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/362097 (owner: 10Awight) [22:19:03] (03CR) 10Awight: [C: 04-1] "Holding off until I understand what "deploy-service" is all about. @Chad good point, ty!" [puppet] - 10https://gerrit.wikimedia.org/r/362097 (owner: 10Awight) [22:19:34] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [22:23:36] (03PS2) 10Awight: Tweak ores::web config file user. [puppet] - 10https://gerrit.wikimedia.org/r/362097 [22:23:49] (03CR) 10Awight: Tweak ores::web config file user. [puppet] - 10https://gerrit.wikimedia.org/r/362097 (owner: 10Awight) [22:27:04] PROBLEM - Host wtp1033 is DOWN: PING CRITICAL - Packet loss = 100% [22:27:05] 10Operations, 10Traffic: Select or Acquire Address Space for Asia Cache DC - https://phabricator.wikimedia.org/T156256#3389037 (10DFoy) My only reference was the conversion of all the zero partners (around 75) from URL-based whitelisting to IP whitelisting when we adopted HTTPS-only, and that took about 5-6 mo... [22:29:14] PROBLEM - Host wtp1034 is DOWN: PING CRITICAL - Packet loss = 100% [22:32:14] RECOVERY - Host wtp1033 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [22:33:04] PROBLEM - Host wtp1035 is DOWN: PING CRITICAL - Packet loss = 100% [22:34:14] PROBLEM - SSH on wtp1033 is CRITICAL: connect to address 10.64.16.88 and port 22: Connection refused [22:34:24] RECOVERY - Host wtp1034 is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms [22:36:34] PROBLEM - SSH on wtp1034 is CRITICAL: connect to address 10.64.16.89 and port 22: Connection refused [22:38:14] RECOVERY - Host wtp1035 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [22:40:24] RECOVERY - SSH on wtp1033 is OK: SSH OK - OpenSSH_7.4p1 Debian-10 (protocol 2.0) [22:40:41] (03PS2) 10EBernhardson: Monitor elasticsearch stats for load test [puppet] - 10https://gerrit.wikimedia.org/r/362091 (https://phabricator.wikimedia.org/T169002) [22:41:22] (03CR) 10EBernhardson: "@gehel I'm going to need this for the LTR load test, to be able to differentiate between overall behavior of search and the behavior of th" [puppet] - 10https://gerrit.wikimedia.org/r/362091 (https://phabricator.wikimedia.org/T169002) (owner: 10EBernhardson) [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170628T2300). [23:00:04] Jdlrobson and ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:01:14] \o [23:01:33] 10Operations, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: mwdeploy user has shell /bin/bash in labs LDAP and /bin/false in production/Puppet - https://phabricator.wikimedia.org/T67591#3389104 (10bd808) [23:01:37] 10Operations, 10LDAP-Access-Requests, 10Labs, 10Labs-Infrastructure, and 2 others: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#3389101 (10bd808) 05Open>03Resolved a:03bd808 [23:03:52] \o [23:08:28] I can SWAT. [23:19:38] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [23:22:23] thcipriani: awsome! [23:22:37] hah was about to ask if my irc had broken :) [23:23:08] heh, just waiting on jenkins... [23:25:58] PROBLEM - Host wtp1037 is DOWN: PING CRITICAL - Packet loss = 100% [23:29:58] (03PS1) 10Mforns: [WIP] Fix timestamp infinite loop in EL purging script [puppet] - 10https://gerrit.wikimedia.org/r/362101 [23:30:54] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Fix timestamp infinite loop in EL purging script [puppet] - 10https://gerrit.wikimedia.org/r/362101 (owner: 10Mforns) [23:31:08] RECOVERY - Host wtp1037 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [23:31:53] jdlrobson: ok, MobileFrontend updates live on mwdebug1002, check please [23:31:59] testing :) [23:32:15] (both wmf.6 and wmf.7) [23:32:47] thcipriani: that's done the trick! thank you [23:32:49] sync away [23:32:56] ok, wmf.7 first, the wmf.6 [23:34:08] PROBLEM - SSH on wtp1037 is CRITICAL: connect to address 10.64.32.229 and port 22: Connection refused [23:35:02] !log thcipriani@tin Synchronized php-1.30.0-wmf.7/extensions/MobileFrontend/includes/specials/SpecialMobileDiff.php: SWAT: [[gerrit:361945|Revert "Run DiffViewHeader in mobile mode, too"]] T169024 (duration: 00m 47s) [23:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:14] T169024: Inactive revision slider button visible on diff view - https://phabricator.wikimedia.org/T169024 [23:36:29] (03PS1) 10Mforns: [WIP] Fix timestamp infinite loop in EL purging script (1) [puppet] - 10https://gerrit.wikimedia.org/r/362103 [23:36:32] !log thcipriani@tin Synchronized php-1.30.0-wmf.6/extensions/MobileFrontend/includes/specials/SpecialMobileDiff.php: SWAT: [[gerrit:361944|Revert "Run DiffViewHeader in mobile mode, too"]] T169024 (duration: 00m 46s) [23:36:37] ^ jdlrobson all live [23:36:42] w00t [23:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:45] thanks thcipriani [23:37:30] ebernhardson: WikimediaEvents change live on mwdebug1002, check please [23:37:36] (03PS2) 10Mforns: [WIP] Fix timestamp infinite loop in EL purging script (2) [puppet] - 10https://gerrit.wikimedia.org/r/362101 [23:38:36] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Fix timestamp infinite loop in EL purging script (2) [puppet] - 10https://gerrit.wikimedia.org/r/362101 (owner: 10Mforns) [23:42:15] thcipriani: looks all sane [23:42:24] ebernhardson: okie doke, going live [23:44:38] !log thcipriani@tin Synchronized php-1.30.0-wmf.7/extensions/WikimediaEvents/modules/ext.wikimediaEvents.searchSatisfaction.js: SWAT: [[gerrit:361950|Adding ssclick events for sister-search results]] T168916 (duration: 00m 46s) [23:44:44] ^ ebernhardson live everywhere [23:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:50] T168916: Sister search result clicks missing from search event logging - https://phabricator.wikimedia.org/T168916 [23:49:08] thcipriani: thanks! [23:52:25] (03PS2) 10Thcipriani: Scap: scap_source correct gid [puppet] - 10https://gerrit.wikimedia.org/r/361796 [23:56:26] (03PS6) 10Krinkle: Enable wgUsejQueryThree on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348120 (https://phabricator.wikimedia.org/T124742) [23:58:39] (03PS3) 10Thcipriani: Scap: scap_source correct gid [puppet] - 10https://gerrit.wikimedia.org/r/361796