[00:00:04] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170413T0000). Please do the needful. [00:00:50] jerkins-bot are you serious, that is a -1 for another change, just that it's a dependency, ok [00:05:51] (03PS2) 10Dzahn: base: blacklist acpi_pad kernel module [puppet] - 10https://gerrit.wikimedia.org/r/348016 [00:12:33] RECOVERY - Recursive DNS on 208.80.153.42 is OK: DNS OK: 0.342 seconds response time. www.wikipedia.org returns 208.80.154.224 [00:13:14] 06Operations, 13Patch-For-Review: Reimage achernar and acamar to jessie - https://phabricator.wikimedia.org/T155411#3177164 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['achernar.wikimedia.org'] ``` and were **ALL** successful. [00:14:12] 06Operations, 10Icinga, 06Labs, 10Labs-Infrastructure: remove/fix "Check for gridmaster host resolution" Icinga check for "labtest" - https://phabricator.wikimedia.org/T152024#3177166 (10Dzahn) >>! In T152024#2835945, @Krenair wrote: > "gridmaster host resolution" is a tools project specific thing, why is... [00:17:16] 06Operations, 10Icinga, 06Labs, 10Labs-Infrastructure: remove/fix "Check for gridmaster host resolution" Icinga check for "labtest" - https://phabricator.wikimedia.org/T152024#3177189 (10Dzahn) Ah, there is a section in hieradata/regex.yaml for labtest-specific settings. That is also what disables paging.... [00:18:43] (03PS1) 10Dzahn: labtest: avoid broken Icinga checks on labtest [puppet] - 10https://gerrit.wikimedia.org/r/348022 (https://phabricator.wikimedia.org/T152024) [00:18:52] (03PS1) 10BBlack: Revert "LVS: remove direct use of achernar recdns" [puppet] - 10https://gerrit.wikimedia.org/r/348023 [00:18:58] (03CR) 10BBlack: [V: 032 C: 032] Revert "LVS: remove direct use of achernar recdns" [puppet] - 10https://gerrit.wikimedia.org/r/348023 (owner: 10BBlack) [00:19:56] PROBLEM - Apache HTTP on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:20:43] RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 5.435 second response time [00:22:57] (03CR) 10jerkins-bot: [V: 04-1] labtest: avoid broken Icinga checks on labtest [puppet] - 10https://gerrit.wikimedia.org/r/348022 (https://phabricator.wikimedia.org/T152024) (owner: 10Dzahn) [00:36:03] !log catrope@tin Finished scap: Split RCFilters GuidedTour messages for ORES vs non-ORES (T162693) (duration: 53m 47s) [00:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:13] T162693: Guided tour for the New Filters mentions ORES predictions on wikis where they are not available - https://phabricator.wikimedia.org/T162693 [01:58:21] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=achernar.wikimedia.org,dc=codfw,cluster=dns,service=pdns_recursor [01:58:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:59:12] 06Operations, 13Patch-For-Review: Reimage achernar and acamar to jessie - https://phabricator.wikimedia.org/T155411#3177281 (10BBlack) 05Open>03Resolved a:03BBlack [02:03:17] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 603 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3005725 keys, up 20 days 9 hours - replication_delay is 603 [02:11:18] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2987024 keys, up 20 days 9 hours - replication_delay is 23 [02:13:55] 06Operations, 10DNS, 10Traffic, 06Services (next): icinga alerts on nodejs services when a recdns server is depooled - https://phabricator.wikimedia.org/T162818#3177301 (10GWicke) Lets perhaps tackle T123854, so that icinga also keeps an eye on the action API? [02:20:27] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 17457304 [02:21:17] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 623 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2987024 keys, up 20 days 10 hours - replication_delay is 623 [02:21:27] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 232 [02:23:10] 06Operations, 10MediaWiki-API, 10Monitoring, 06Services, 10Traffic: Set up action API latency / error rate metrics & alerts - https://phabricator.wikimedia.org/T123854#3177305 (10GWicke) Using grafana's new & spiffy alert feature, I set up a simple alert for the RESTBase backend request latency using the... [03:20:57] PROBLEM - NTP peers on achernar is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [03:38:27] PROBLEM - MD RAID on ms-be2023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:41:57] PROBLEM - very high load average likely xfs on ms-be2023 is CRITICAL: CRITICAL - load average: 170.83, 111.93, 58.50 [03:43:57] PROBLEM - Check systemd state on ms-be2023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:44:07] PROBLEM - Disk space on ms-be2023 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdb3 is not accessible: Input/output error [03:46:57] RECOVERY - very high load average likely xfs on ms-be2023 is OK: OK - load average: 60.00, 77.89, 58.61 [03:48:27] PROBLEM - swift-container-updater on ms-be2023 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [03:48:57] PROBLEM - swift-container-replicator on ms-be2023 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [03:49:17] PROBLEM - swift-object-auditor on ms-be2023 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [03:49:17] PROBLEM - swift-object-replicator on ms-be2023 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [03:49:37] PROBLEM - swift-account-replicator on ms-be2023 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [03:52:57] PROBLEM - swift-object-updater on ms-be2023 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [04:12:03] !log ms-be2023 icinga alerts, no more swift processes. cant ssh to it. attempt to power cycle. mgmt console enourmous spam of "rejecting I/O to offline device" [04:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:13:37] PROBLEM - Host ms-be2023 is DOWN: PING CRITICAL - Packet loss = 100% [04:14:12] !log ms-be2023 is rebooting [04:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:14:27] RECOVERY - swift-container-updater on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [04:14:37] RECOVERY - swift-account-replicator on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [04:14:37] RECOVERY - Host ms-be2023 is UP: PING OK - Packet loss = 0%, RTA = 37.87 ms [04:14:57] RECOVERY - Check systemd state on ms-be2023 is OK: OK - running: The system is fully operational [04:14:57] RECOVERY - swift-object-updater on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [04:14:57] RECOVERY - swift-container-replicator on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [04:15:07] RECOVERY - Disk space on ms-be2023 is OK: DISK OK [04:15:17] RECOVERY - swift-object-auditor on ms-be2023 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [04:15:18] RECOVERY - swift-object-replicator on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [04:15:18] RECOVERY - MD RAID on ms-be2023 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [04:19:57] PROBLEM - Keyholder SSH agent on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:21:17] PROBLEM - puppet last run on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:22:17] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 33 minutes ago with 0 failures [04:22:52] 06Operations: ms-be2023 freeze - https://phabricator.wikimedia.org/T162854#3177353 (10Dzahn) [04:23:57] RECOVERY - Keyholder SSH agent on mira is OK: OK: Keyholder is armed with all configured keys. [04:26:38] PROBLEM - MD RAID on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:27:27] PROBLEM - puppet last run on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:27:37] PROBLEM - Check systemd state on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:28:17] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 39 minutes ago with 0 failures [04:28:27] RECOVERY - Check systemd state on mira is OK: OK - running: The system is fully operational [04:29:27] RECOVERY - MD RAID on mira is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [04:30:47] PROBLEM - Check the NTP synchronisation status of timesyncd on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:31:57] PROBLEM - Keyholder SSH agent on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:34:47] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:34:47] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:35:37] PROBLEM - MD RAID on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:35:47] RECOVERY - Keyholder SSH agent on mira is OK: OK: Keyholder is armed with all configured keys. [04:38:37] RECOVERY - MD RAID on mira is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [04:38:57] PROBLEM - Keyholder SSH agent on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:39:15] (03PS1) 10Papaul: DNS:Add mgmt and production DNS for db20[7-9][0-9] [dns] - 10https://gerrit.wikimedia.org/r/348037 [04:39:17] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2987744 keys, up 20 days 12 hours - replication_delay is 21 [04:39:27] PROBLEM - puppet last run on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:39:57] RECOVERY - Keyholder SSH agent on mira is OK: OK: Keyholder is armed with all configured keys. [04:42:57] PROBLEM - Keyholder SSH agent on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:43:27] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 54 minutes ago with 0 failures [04:43:34] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: codfw rack/setup 22 DB servers - https://phabricator.wikimedia.org/T162159#3177374 (10Papaul) [04:46:37] PROBLEM - MD RAID on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:48:27] PROBLEM - puppet last run on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:48:27] PROBLEM - configured eth on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:48:57] RECOVERY - Keyholder SSH agent on mira is OK: OK: Keyholder is armed with all configured keys. [04:49:17] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 621 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2987744 keys, up 20 days 12 hours - replication_delay is 621 [04:49:27] RECOVERY - configured eth on mira is OK: OK - interfaces up [04:51:37] RECOVERY - MD RAID on mira is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [04:51:57] PROBLEM - carbon-local-relay metric drops on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [100.0] [04:53:57] RECOVERY - carbon-local-relay metric drops on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [04:57:17] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2987851 keys, up 20 days 12 hours - replication_delay is 0 [04:58:37] PROBLEM - MD RAID on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:58:57] PROBLEM - Keyholder SSH agent on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:59:57] RECOVERY - Keyholder SSH agent on mira is OK: OK: Keyholder is armed with all configured keys. [05:02:37] RECOVERY - MD RAID on mira is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [05:04:27] PROBLEM - configured eth on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:05:27] RECOVERY - configured eth on mira is OK: OK - interfaces up [05:07:37] PROBLEM - MD RAID on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:12:28] RECOVERY - MD RAID on mira is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [05:15:37] PROBLEM - MegaRAID on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:15:57] PROBLEM - Keyholder SSH agent on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:16:57] RECOVERY - Keyholder SSH agent on mira is OK: OK: Keyholder is armed with all configured keys. [05:20:37] PROBLEM - MD RAID on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:24:37] RECOVERY - MD RAID on mira is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [05:24:57] PROBLEM - Keyholder SSH agent on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:26:57] RECOVERY - Keyholder SSH agent on mira is OK: OK: Keyholder is armed with all configured keys. [05:29:37] PROBLEM - MD RAID on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:30:37] RECOVERY - MD RAID on mira is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [05:31:27] PROBLEM - configured eth on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:32:28] RECOVERY - configured eth on mira is OK: OK - interfaces up [05:33:37] PROBLEM - MD RAID on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:35:27] RECOVERY - MegaRAID on mira is OK: OK: no disks configured for RAID [05:36:57] PROBLEM - Keyholder SSH agent on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:37:37] RECOVERY - MD RAID on mira is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [05:40:57] RECOVERY - Keyholder SSH agent on mira is OK: OK: Keyholder is armed with all configured keys. [05:46:37] PROBLEM - MD RAID on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:47:37] RECOVERY - MD RAID on mira is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [05:47:57] PROBLEM - Keyholder SSH agent on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:47:57] PROBLEM - DPKG on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:48:47] RECOVERY - Keyholder SSH agent on mira is OK: OK: Keyholder is armed with all configured keys. [05:48:47] RECOVERY - DPKG on mira is OK: All packages OK [05:50:27] PROBLEM - configured eth on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:51:57] PROBLEM - Keyholder SSH agent on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:52:27] RECOVERY - configured eth on mira is OK: OK - interfaces up [05:53:57] RECOVERY - Keyholder SSH agent on mira is OK: OK: Keyholder is armed with all configured keys. [05:55:07] PROBLEM - Host mira is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2717.04 ms [05:55:37] RECOVERY - Host mira is UP: PING WARNING - Packet loss = 73%, RTA = 139.28 ms [05:56:47] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [06:03:07] PROBLEM - Host mira is DOWN: PING CRITICAL - Packet loss = 100% [06:04:07] PROBLEM - puppet last run on mw1186 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:04:37] RECOVERY - Host mira is UP: PING WARNING - Packet loss = 37%, RTA = 727.96 ms [06:07:37] PROBLEM - MD RAID on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:07:57] PROBLEM - Keyholder SSH agent on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:08:57] RECOVERY - Keyholder SSH agent on mira is OK: OK: Keyholder is armed with all configured keys. [06:09:07] PROBLEM - Host mira is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2142.78 ms [06:09:17] mira is not feeling good :) [06:09:27] RECOVERY - Host mira is UP: PING OK - Packet loss = 0%, RTA = 346.71 ms [06:09:37] RECOVERY - MD RAID on mira is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [06:12:30] I can see in dmesg a lot of tg3 0000:02:00.0 eth0 eetc.. [06:13:47] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:14:28] !log powercycle mira - eth0 errors in the dmesg, CPU system utilization skyrocketed [06:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:37] PROBLEM - Host mira is DOWN: PING CRITICAL - Packet loss = 100% [06:18:37] RECOVERY - Host mira is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [06:18:47] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [06:20:47] PROBLEM - Keyholder SSH agent on mira is CRITICAL: CRITICAL: Cannot connect to keyholder-proxy socket /run/keyholder/proxy.sock. [06:21:07] PROBLEM - nutcracker process on mira is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (nutcracker), command name nutcracker [06:21:17] PROBLEM - nutcracker port on mira is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [06:21:27] PROBLEM - Check systemd state on mira is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:23:07] RECOVERY - nutcracker process on mira is OK: PROCS OK: 1 process with UID = 111 (nutcracker), command name nutcracker [06:23:17] RECOVERY - nutcracker port on mira is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [06:24:27] RECOVERY - Check systemd state on mira is OK: OK - running: The system is fully operational [06:25:17] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:26:53] I don't see any errors anymore.. weird [06:26:57] PROBLEM - carbon-local-relay metric drops on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [100.0] [06:27:20] moritzm: o/ - was mira upgraded to 4.9 recently? [06:27:39] one week ago [06:27:47] weird [06:27:57] RECOVERY - carbon-local-relay metric drops on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [06:28:09] https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=mira&refresh=1m&orgId=1&from=now-3h&to=now looks really strange [06:28:36] system cpu increase all of a sudden and kernel errors for eth0 [06:29:04] !log re-arm keyholder on mira after reboot [06:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:34] elukey: yeah, some time last week. having a look [06:30:37] RECOVERY - Check the NTP synchronisation status of timesyncd on mira is OK: OK: synced at Thu 2017-04-13 06:30:35 UTC. [06:30:47] RECOVERY - Keyholder SSH agent on mira is OK: OK: Keyholder is armed with all configured keys. [06:31:16] one pass to re-arm the keyholder [06:31:17] wow [06:31:23] nice! [06:33:07] RECOVERY - puppet last run on mw1186 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:33:49] the NIC on mira hung itself up starting at 5:45 and from there things went downhill [06:45:11] yep matches with the graphs.. [06:47:05] I'll open a ticket, we should swap the card [06:47:39] it's OOW, but maybe we have a spare around [06:55:02] 06Operations, 06Performance-Team: Some Core availability Catchpoint tests might be more expensive than they need to be - https://phabricator.wikimedia.org/T162857#3177488 (10Gilles) [07:01:26] 06Operations, 10ops-codfw: Swap NIC on mira - https://phabricator.wikimedia.org/T162859#3177514 (10MoritzMuehlenhoff) [07:08:03] (03PS2) 10Muehlenhoff: Allow silencing a debconf query [puppet] - 10https://gerrit.wikimedia.org/r/347378 [07:22:17] (03CR) 10Muehlenhoff: [C: 032] Allow silencing a debconf query [puppet] - 10https://gerrit.wikimedia.org/r/347378 (owner: 10Muehlenhoff) [07:42:12] (03PS1) 10Filippo Giunchedi: nagios: add team-performance contact [puppet] - 10https://gerrit.wikimedia.org/r/348041 (https://phabricator.wikimedia.org/T161703) [07:44:11] (03PS2) 10Filippo Giunchedi: nagios: add team-performance contact [puppet] - 10https://gerrit.wikimedia.org/r/348041 (https://phabricator.wikimedia.org/T161703) [07:46:37] (03CR) 10Filippo Giunchedi: [C: 032] nagios: add team-performance contact [puppet] - 10https://gerrit.wikimedia.org/r/348041 (https://phabricator.wikimedia.org/T161703) (owner: 10Filippo Giunchedi) [07:50:19] 06Operations, 06Performance-Team, 13Patch-For-Review: Add performance-team contact group to private.git - https://phabricator.wikimedia.org/T161703#3177559 (10fgiunchedi) 05Open>03Resolved Completed! emails to `performance-team` ML should be happening now. Note that for consistency with the rest the actu... [08:13:06] (03PS1) 10Muehlenhoff: Remove obsolete parameter for debconf::seen [puppet] - 10https://gerrit.wikimedia.org/r/348044 [08:17:28] (03CR) 10Muehlenhoff: [C: 032] Remove obsolete parameter for debconf::seen [puppet] - 10https://gerrit.wikimedia.org/r/348044 (owner: 10Muehlenhoff) [08:22:23] (03PS17) 10Gehel: maps - move to role / profile [puppet] - 10https://gerrit.wikimedia.org/r/347006 [08:32:35] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3177629 (10Dereckson) @Rameshti @nirajan_pant @Janak_bhatta Perhaps could you have a meeting on IRC, Hangout or at an Incubator talk page to discuss the translations, then report t... [08:35:16] (03CR) 10Dereckson: [C: 031] "wb. isn't a country code, but the ISO 3166 code is well IN-WB, so "wb" is indeed the second level administrative region code." [dns] - 10https://gerrit.wikimedia.org/r/347141 (https://phabricator.wikimedia.org/T162510) (owner: 10Urbanecm) [08:35:38] (03PS6) 10Muehlenhoff: Mark wireshark-common/install-setuid as seen to avoid debconf prompt [puppet] - 10https://gerrit.wikimedia.org/r/346162 [08:42:50] 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3177634 (10hashar) There is a global lock to execute commands, but would it also preve... [08:45:26] (03CR) 10Dereckson: [C: 031] Add wb.wikimedia.org to ServerAlias for wikimedia-chapter Vhost [puppet] - 10https://gerrit.wikimedia.org/r/347142 (https://phabricator.wikimedia.org/T162510) (owner: 10Urbanecm) [09:04:07] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3177676 (10Janak_bhatta) Apologize for miss understanding. We have already discussed about the translations on dty incubator project. We came on the result that we are going for t... [09:12:48] !log rebooting restbase1010 to Linux 4.9 [09:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:22] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3177696 (10Dereckson) 05stalled>03Open Thanks for the update. Let's see if all is ready to create the wiki this evening. [09:26:26] (03CR) 10Volans: [C: 032] Change the RO message we match [switchdc] - 10https://gerrit.wikimedia.org/r/347868 (https://phabricator.wikimedia.org/T160178) (owner: 10Giuseppe Lavagetto) [09:30:53] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 0.1.38 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/345608 (owner: 10Gilles) [09:31:32] (03CR) 10Volans: [C: 032] Warmup: remove api warmup, useless, and add a log line. [switchdc] - 10https://gerrit.wikimedia.org/r/347870 (owner: 10Giuseppe Lavagetto) [09:31:38] (03PS2) 10Volans: Warmup: remove api warmup, useless, and add a log line. [switchdc] - 10https://gerrit.wikimedia.org/r/347870 (owner: 10Giuseppe Lavagetto) [09:31:45] (03CR) 10jerkins-bot: [V: 04-1] Warmup: remove api warmup, useless, and add a log line. [switchdc] - 10https://gerrit.wikimedia.org/r/347870 (owner: 10Giuseppe Lavagetto) [09:32:29] (03PS3) 10Volans: Warmup: remove api warmup, useless, and add a log line. [switchdc] - 10https://gerrit.wikimedia.org/r/347870 (owner: 10Giuseppe Lavagetto) [09:36:49] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3177739 (10Dereckson) **Namespaces translation issue.** Not directly related to the wiki creation but a small preoccupation: you perhaps want to translate main namespaces (User: F... [09:39:03] !log rebooting restbase1011 to Linux 4.9 [09:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:17] (03PS2) 10Dereckson: Initial configuration for dtywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347217 (https://phabricator.wikimedia.org/T161529) (owner: 10DatGuy) [09:42:04] (03CR) 10Dereckson: "PS2: HD logos, translation update, references to task ID" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347217 (https://phabricator.wikimedia.org/T161529) (owner: 10DatGuy) [09:42:36] (03CR) 10Dereckson: [C: 04-1] "We want namespaces localisation before to create this wiki." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347217 (https://phabricator.wikimedia.org/T161529) (owner: 10DatGuy) [09:46:22] (03PS4) 10Volans: Move varnish puppet disabling in t00 [switchdc] - 10https://gerrit.wikimedia.org/r/347869 (https://phabricator.wikimedia.org/T160178) (owner: 10Giuseppe Lavagetto) [09:47:44] (03CR) 10Volans: [C: 032] Move varnish puppet disabling in t00 [switchdc] - 10https://gerrit.wikimedia.org/r/347869 (https://phabricator.wikimedia.org/T160178) (owner: 10Giuseppe Lavagetto) [09:54:07] (03PS7) 10Muehlenhoff: Mark wireshark-common/install-setuid as seen to avoid debconf prompt [puppet] - 10https://gerrit.wikimedia.org/r/346162 [09:55:30] (03CR) 10Ema: [C: 031] "One nit, LGTM otherwise. Confirmed to be working as advertised on my self-hosted puppetmaster." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/346162 (owner: 10Muehlenhoff) [09:56:08] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3038_v4, cp3038_v6 [09:56:17] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [09:56:17] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3038_v4, cp3038_v6 [09:56:17] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3038_v4, cp3038_v6 [09:56:17] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [09:56:17] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [09:56:17] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [09:56:18] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [09:56:18] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [09:56:27] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3038_v4, cp3038_v6 [09:56:27] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [09:56:37] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [09:56:37] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3038_v4, cp3038_v6 [09:56:40] (03CR) 10Filippo Giunchedi: "LGTM, PCC fails on tin/mira currently but the change is ok on e.g. mw1200" [puppet] - 10https://gerrit.wikimedia.org/r/347898 (https://phabricator.wikimedia.org/T162814) (owner: 10Thcipriani) [09:56:47] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [09:57:06] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3038_v4, cp3038_v6 [09:57:06] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3038_v4, cp3038_v6 [09:57:07] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3038_v4, cp3038_v6 [09:57:07] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3038_v4, cp3038_v6 [09:57:10] ema: cp3038 is you? [09:57:16] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [09:57:31] godog: yep [09:57:46] PROBLEM - Host cp3038 is DOWN: PING CRITICAL - Packet loss = 100% [09:59:46] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [10:00:26] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3038_v4, cp3038_v6 [10:02:43] godog: can you reach cp3038's management interface? [10:03:40] ema: I tried but it doesn't work :( [10:03:49] meh [10:05:11] I can't access it either, tried to connect when when alarms went off [10:06:20] (03PS1) 10Volans: Menu: don't allow to quit from submenu [switchdc] - 10https://gerrit.wikimedia.org/r/348055 (https://phabricator.wikimedia.org/T160178) [10:06:41] (03PS1) 10Ayounsi: Depooling ESAMS for T162601 [dns] - 10https://gerrit.wikimedia.org/r/348056 [10:08:22] (03PS2) 10Elukey: check_hadoop_yarn_node_state: add syslog logging for CRITICAL states [puppet] - 10https://gerrit.wikimedia.org/r/347857 [10:08:40] (03CR) 10Ema: [C: 031] Depooling ESAMS for T162601 [dns] - 10https://gerrit.wikimedia.org/r/348056 (owner: 10Ayounsi) [10:08:45] !log rebooting restbase1016 to Linux 4.9 [10:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:10] (03CR) 10Ayounsi: [C: 032] Depooling ESAMS for T162601 [dns] - 10https://gerrit.wikimedia.org/r/348056 (owner: 10Ayounsi) [10:11:09] (03CR) 10Volans: [C: 032] Menu: don't allow to quit from submenu [switchdc] - 10https://gerrit.wikimedia.org/r/348055 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [10:13:38] !log upgrade thumbor to 0.1.38 [10:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:57] !log executed CONFIG SET appendfsync no (prev value: "everysec") to Redis instance 6380 on rdb2005 - T125735 [10:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:05] T125735: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735 [10:31:10] 06Operations, 10ops-esams, 06DC-Ops: Broken IPMI/drac on cp3038 - https://phabricator.wikimedia.org/T157537#3177823 (10ema) Got bitten by this again today while rebooting cp3038 into Linux 4.9. I'd say it's time to fix this machine. Note that this time there is absolutely no way to bring the host back onlin... [10:32:05] ACKNOWLEDGEMENT - Host cp3038 is DOWN: PING CRITICAL - Packet loss = 100% Ema T157537 [10:33:23] PROBLEM - puppet last run on mw1239 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:36:33] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 641 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2998815 keys, up 20 days 18 hours - replication_delay is 641 [10:37:18] guess that my change triggered an issue in another db.. --^ [10:37:20] checking [10:39:08] so it seems that it lost the connection with the master and it is trying to resync [10:42:06] on the precautionary side I reverted my change on the 6380 instance [10:42:14] !log disable V6 transit BGP session on cr2-knams for T162601 [10:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:21] T162601: knams equipment move - https://phabricator.wikimedia.org/T162601 [10:42:36] (03PS1) 10Filippo Giunchedi: Switch deployment server to codfw [dns] - 10https://gerrit.wikimedia.org/r/348060 [10:47:09] !log Confirmed we can still reach cr2-knams:lo0 via v6 (from esams), disabling IPv4 transit sessions for T162601 [10:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:23] !log reverted previous config for Redis rdb2005 [10:47:30] (03PS1) 10Gilles: Make Thumbor connect to Swift via https [puppet] - 10https://gerrit.wikimedia.org/r/348061 [10:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:03] PROBLEM - Check Varnish expiry mailbox lag on cp1048 is CRITICAL: CRITICAL: expiry mailbox lag is 664409 [10:54:53] who is in charge of prometheus? [10:55:24] godog is surely the first point of reference :) [10:55:40] one of its cron job is flooding my inbox with error messages [10:56:07] XioNoX: that's ema's fault :D [10:56:10] well, not my inbox inbox, but my cron folder, still should be fixed [10:56:14] I've reported it yesterday in traffic [10:56:23] at reboot of the cache hosts [10:56:26] it's triggered [10:57:29] ah okay [10:57:47] 06Operations, 13Patch-For-Review, 15User-Elukey: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#3177893 (10Volans) [10:57:49] 06Operations, 10Traffic, 13Patch-For-Review, 15User-Elukey: prometheus-vhtcpd-stats cronspamming if vhtcpd is not running yet - https://phabricator.wikimedia.org/T157353#3177891 (10Volans) 05Resolved>03Open Re-opening because this is happening when rebooting hosts, see last days root@ mails [10:57:49] is there a task to track it? [10:57:59] ah [10:58:12] !log rebooting alfafi to Linux 4.9 [10:58:16] there is a tracking for all root@ "spam" and I've just re-opened the related one [10:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:20] !log rebooting alsafi to Linux 4.9 [10:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:25] that already existed apparently :D [10:58:27] that's perfect, I was exactly looking to do that (only having actionable emails in my inbox :) [10:59:03] it's actually part of the clinic duty list of things to check [11:00:32] clinig duty? [11:00:37] er clinic duty? [11:01:09] https://wikitech.wikimedia.org/wiki/Ops_Clinic_Duty [11:01:23] RECOVERY - puppet last run on mw1239 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [11:03:58] 06Operations, 06Performance-Team, 10Thumbor: Write graceful rolling restart script for Thumbor - https://phabricator.wikimedia.org/T162875#3177901 (10Gilles) [11:04:07] 06Operations, 06Performance-Team, 10Thumbor: Write graceful rolling restart script for Thumbor - https://phabricator.wikimedia.org/T162875#3177913 (10Gilles) p:05Triage>03Lowest [11:07:40] ah ok now I got it what is port 6479 on rdb2005, we don't have rdb2007 so rdb2005 has more Redis instances [11:08:32] so I am still watching logs but it seems that the replication fails for the output-buffer-issue that we thought we had solved (cc: akosiaris ) [11:10:03] cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits [11:10:17] Connection with slave 10.192.32.133:6479 lost. [11:10:18] etc.. [11:10:23] PROBLEM - Host cr2-knams is DOWN: CRITICAL - Network Unreachable (91.198.174.246) [11:12:53] PROBLEM - Host cr2-knams IPv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:862:ffff::4) [11:12:57] MASTER <-> SLAVE sync: receiving 1630566050 bytes from master [11:13:05] that afaics are 1.6GB [11:13:20] (rdb2005 receiving from rdb1007) [11:13:54] ACKNOWLEDGEMENT - Host cr2-knams IPv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:862:ffff::4) Ema https://phabricator.wikimedia.org/T162601 - The acknowledgement expires at: 2017-04-14 15:00:00. [11:13:54] ACKNOWLEDGEMENT - Host cr2-knams is DOWN: CRITICAL - Network Unreachable (91.198.174.246) Ema https://phabricator.wikimedia.org/T162601 - The acknowledgement expires at: 2017-04-14 15:00:00. [11:14:21] elukey: I guess is a total counter [11:15:05] look at the graphs for eth of rdb2005 ;) [11:15:37] volans: yep yep, but we set soft and hard limits on the master for the output-buffer-limit [11:15:49] and I think we are hitting the soft one [11:15:56] ok [11:16:15] we had the same issue a while ago and Alex fixed it increasing the buffers [11:16:26] to 2048m hard limit and 512m soft [11:24:39] so the soft limit should be if the client output buffer size is 512m for 60 seconds [11:25:44] in this case, if the threshold is met, the redis master drops the connection [11:25:52] the slave is sad, and retries to sync [11:26:00] etc.. [11:34:23] RECOVERY - Host cr2-knams is UP: PING OK - Packet loss = 0%, RTA = 86.75 ms [11:36:43] PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 44, down: 5, dormant: 0, excluded: 0, unused: 0BRxe-0/0/2: down - Transit: Tele2 (AMS13-CORE-1:4/2, donated) {#13443} [10Gbps]BRxe-1/2/0: down - Transit: Init7 (donated) {#14009} [10Gbps]BRxe-1/0/0: down - Core: asw-esams:xe-3/0/42 (GBLX leg 2) {#14007} [10Gbps DF CWDM C49]BRxe-1/3/0: down - Transit: LibertyGlobal (BB00088, donat [11:37:18] !log temporary set config set client-output-buffer-limit "slave 2147483648 2147483648 60" on rdb1007:6379 to give time to rdb2005's replication to catch up - T159850 [11:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:26] T159850: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850 [11:38:43] RECOVERY - Host cr2-knams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 84.78 ms [11:41:55] 06Operations, 10ops-codfw, 10Traffic: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3152568 (10faidon) >>! In T162099#3169146, @BBlack wrote: > @ayounsi Let's let it burn in with no traffic until tomorrow sometime, then sync up on reverting the router config hacks and watching the traf... [11:42:30] what is cr2-knams for? It keeps going down. [11:42:54] paladox: https://phabricator.wikimedia.org/T162601 [11:43:03] thanks [11:43:40] 06Operations, 07HHVM: Frequent TCP RST on connections between HHVM and Redis - https://phabricator.wikimedia.org/T162354#3177957 (10MoritzMuehlenhoff) I've merged the patch into our 3.18 package, a build is currently running on copper (but will take a few hours). [11:47:53] (03PS1) 10Ema: Bind instrumentation port to localhost only [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/348066 [11:49:55] rdb1007:6739 still complaining about output-buffer-limit reached [11:50:24] I'll keep watching this sync (maybe new params are applied now) [11:51:10] if this is not enough, I'll raise the limits a bit more [11:52:00] 06Operations, 10Icinga: Update icinga to 2.x - https://phabricator.wikimedia.org/T162542#3177977 (10Paladox) I've actually been working on this. It's not as hard as you make it. I am doing it for labs. I've setup https://gerrit-icinga.wmflabs.org/icingaweb2/ I found the config files will need changing. nrpe... [11:54:43] RECOVERY - Router interfaces on cr2-knams is OK: OK: host 91.198.174.246, interfaces up: 57, down: 0, dormant: 0, excluded: 0, unused: 0 [11:54:46] (03CR) 10Paladox: "I think icinga2 has a systemd file." [puppet] - 10https://gerrit.wikimedia.org/r/347899 (owner: 10Paladox) [11:59:31] !log temporary set config set client-output-buffer-limit "slave 2536870912 2536870912 60" on rdb1007:6379 [11:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:49] this one was executed by Alex the last time [11:59:57] let's see if it helps [12:01:13] (03PS11) 10Paladox: Create an icinga2 class [puppet] - 10https://gerrit.wikimedia.org/r/347640 [12:04:23] (03PS2) 10Ema: Release pybal 1.13.6 [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/348066 [12:09:51] not working, output buffers limit breached again [12:10:04] mmmm [12:10:26] (03CR) 10Giuseppe Lavagetto: [C: 031] Release pybal 1.13.6 (031 comment) [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/348066 (owner: 10Ema) [12:10:47] so the error is [12:10:48] Client id=2466259767 addr=10.192.32.133:42552 fd=45 name= age=413 idle=413 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=13052 oll=68537 omem=2536946792 events=rw cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits. [12:11:17] I guess that omem=2536946792 is the memory used by the buffer [12:11:35] but afaics on the slave only 1.6G are requested [12:11:50] <_joe_> elukey: the solution to your problem is understanding what the fuck are we doing with those 1.6 gb [12:12:04] _joe_ that's the full db resync [12:12:10] hello :) [12:12:35] <_joe_> elukey: i mean why do we have ssuch a huge dataset on one redis instance [12:12:44] * _joe_ fades away again [12:13:13] okok the answer to the question is right below "Remove lua scripts from the jobqueues" [12:13:20] it might take a while :P [12:16:42] !log restarting ntp on achernar [12:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:20] (03CR) 10Volans: [C: 032] Use a generic retry for the read only message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347992 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [12:18:29] (03Merged) 10jenkins-bot: Use a generic retry for the read only message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347992 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [12:18:39] (03CR) 10jenkins-bot: Use a generic retry for the read only message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347992 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [12:22:25] !log volans@tin Synchronized wmf-config/db-codfw.php: Use a generic retry for the read only message T160178 (duration: 01m 54s) [12:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:32] T160178: MediaWiki Datacenter Switchover automation - https://phabricator.wikimedia.org/T160178 [12:26:33] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2988715 keys, up 20 days 20 hours - replication_delay is 0 [12:27:35] ah! [12:27:43] hello rdb2005! [12:30:47] ah this might be only the alarm, I can still see Redis trying to sync [12:31:09] (03PS3) 10Ema: Release pybal 1.13.6 [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/348066 (https://phabricator.wikimedia.org/T103882) [12:31:45] (03CR) 10Ema: Release pybal 1.13.6 (031 comment) [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/348066 (https://phabricator.wikimedia.org/T103882) (owner: 10Ema) [12:32:21] (03PS4) 10Ema: Release pybal 1.13.6 [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/348066 (https://phabricator.wikimedia.org/T103882) [12:34:16] !log temporary set config set client-output-buffer-limit "slave 3221225472 3221225472 180" on rdb1007:6379 [12:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:31] !log volans@tin Synchronized wmf-config/db-eqiad.php: Use a generic retry for the read only message T160178 (duration: 00m 44s) [12:34:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:38] T160178: MediaWiki Datacenter Switchover automation - https://phabricator.wikimedia.org/T160178 [12:38:33] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 612 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2988826 keys, up 20 days 20 hours - replication_delay is 612 [12:42:17] so even with 3221225472 we reach the hard limit, if I am reading correctly from the logs [12:47:18] ok so what is happening is that the master calls BGSAVE that creates a new rdb file under /srv/redis/rdb1007-6379.rdb [12:47:27] 06Operations, 10Traffic, 10netops, 13Patch-For-Review: knams equipment move - https://phabricator.wikimedia.org/T162601#3178079 (10ayounsi) Work finished at 14:00 local time, interfaces confirmed up, LACP active. BGP re-enabled at ~14:30. Everything established, interfaces passing traffic. Will revert the... [12:47:34] then it sends it to the slave [12:47:44] but that file is only 1.6G [12:48:09] so there should be plenty of output buffer space with the current limits [12:48:33] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2988752 keys, up 20 days 20 hours - replication_delay is 0 [12:51:14] [17274] 13 Apr 12:47:58.730 * Synchronization with slave succeeded [12:51:17] * elukey dances [12:51:25] lol [12:51:50] does this means that it requires size*2? :D [12:51:55] the bad news is that I had to use 5G as hard limit [12:51:58] /o\ [12:52:17] !log temporary set config set client-output-buffer-limit "slave 5368709120 5368709120 180" on rdb1007:6379 [12:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:32] o/ [12:57:49] hey hashar, welcome back :) [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170413T1300). Please do the needful. [13:00:04] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:13] o/ [13:00:15] I'm here :) [13:00:28] !log Upgrading thumbor* to Linux 4.9 [13:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:03] RECOVERY - Check Varnish expiry mailbox lag on cp1048 is OK: OK: expiry mailbox lag is 0 [13:02:07] I can swat today. unless somebody else wants to do it? [13:03:02] 06Operations, 13Patch-For-Review, 15User-Elukey: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850#3178102 (10elukey) 05Resolved>03Open [13:03:05] Urbanecm: I dont get the project logo system [13:03:20] hashar: I don't understand you. Which project logo system? [13:03:35] Urbanecm: we used to have mw configured to point to [[Image:Wiki.png]] so that each projects can easily change their logo [13:03:48] Hi [13:03:50] so eg arwikinews has //upload.wikimedia.org/wikinews/ar/b/bc/Wiki.png [13:03:56] hashar: ori and krinkle changed it [13:04:04] we now host logos in static/ folder [13:04:33] As a drawback, projects lost freedom to personalize logos for anniversairies, commemorations, etc. [13:04:46] Now I understand. At almost all families logos are in static. So I've converted Wikinews and Wikiversities to the same system (and asked for user-notice). [13:04:51] As an advantage, we've a centralized place to maintain them, resources are cached, optimzied, etc. [13:05:29] 06Operations, 10ops-codfw, 10Traffic: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3178115 (10ayounsi) a:05ayounsi>03BBlack Moving that one back to Brandon [13:05:30] ok make sense [13:05:30] hashar, Dereckson: should I do the swat, or do you want to? [13:05:41] then I am not sure why it happens only now :D [13:05:59] (03CR) 10Hashar: [C: 031] Convert $stdlogo to static/images/project-logos resources at Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346054 (https://phabricator.wikimedia.org/T161980) (owner: 10Urbanecm) [13:06:08] (03CR) 10Hashar: [C: 031] Switch from $stdlogo to static resources at Wikiversities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346047 (https://phabricator.wikimedia.org/T161980) (owner: 10Urbanecm) [13:06:30] hashar: why what happens only now? [13:06:58] Urbanecm: the centralised place [13:07:00] Urbanecm: he probably talks about changing paths to project logos cc hashar [13:07:02] Urbanecm: in all logic, someone should have noticed some logos were still missing long before you did [13:07:25] (03CR) 10Hashar: [C: 031] Close wikimania2016 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344321 (https://phabricator.wikimedia.org/T161183) (owner: 10Urbanecm) [13:08:02] Urbanecm: yeah I am just ranting about how the switch from local [[File:Wiki.png]] to static/project-logos should have been done earlier and in one pass [13:08:04] ignore me :D [13:08:18] hashar: should I merge the commits? :D [13:08:19] so who is gonna push the patches ? :D [13:08:30] hashar: we thought you are going ;) [13:09:01] once upon a time SWAT was organised for once xD [13:09:39] I can push the changes, unless somebody else wants to do it [13:09:52] rebasing them [13:10:01] (03PS2) 10Hashar: Close wikimania2016 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344321 (https://phabricator.wikimedia.org/T161183) (owner: 10Urbanecm) [13:10:03] (03PS3) 10Hashar: Convert $stdlogo to static/images/project-logos resources at Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346054 (https://phabricator.wikimedia.org/T161980) (owner: 10Urbanecm) [13:10:05] (03PS6) 10Hashar: Switch from $stdlogo to static resources at Wikiversities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346047 (https://phabricator.wikimedia.org/T161980) (owner: 10Urbanecm) [13:10:33] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6479 [13:10:46] (03PS1) 10Ayounsi: Revert "Depooling ESAMS for T162601" [dns] - 10https://gerrit.wikimedia.org/r/348069 [13:10:51] zeljkof: wanna sync them or should I? [13:10:57] hashar: go ahead [13:11:01] \O/ [13:11:27] can someone review/+1/r+ https://gerrit.wikimedia.org/r/#/c/348069/ ? [13:11:29] (03CR) 10Hashar: [C: 032] Close wikimania2016 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344321 (https://phabricator.wikimedia.org/T161183) (owner: 10Urbanecm) [13:11:31] (03CR) 10Hashar: [C: 032] Convert $stdlogo to static/images/project-logos resources at Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346054 (https://phabricator.wikimedia.org/T161980) (owner: 10Urbanecm) [13:11:33] (03CR) 10Hashar: [C: 032] Switch from $stdlogo to static resources at Wikiversities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346047 (https://phabricator.wikimedia.org/T161980) (owner: 10Urbanecm) [13:11:33] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2989282 keys, up 20 days 20 hours - replication_delay is 0 [13:11:34] ema: ^ [13:11:47] er, I mean my previous message above the bots [13:12:20] XioNoX: looking [13:12:36] (03Merged) 10jenkins-bot: Close wikimania2016 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344321 (https://phabricator.wikimedia.org/T161183) (owner: 10Urbanecm) [13:12:38] (03Merged) 10jenkins-bot: Convert $stdlogo to static/images/project-logos resources at Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346054 (https://phabricator.wikimedia.org/T161980) (owner: 10Urbanecm) [13:12:41] (03Merged) 10jenkins-bot: Switch from $stdlogo to static resources at Wikiversities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346047 (https://phabricator.wikimedia.org/T161980) (owner: 10Urbanecm) [13:12:50] (03CR) 10jenkins-bot: Close wikimania2016 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344321 (https://phabricator.wikimedia.org/T161183) (owner: 10Urbanecm) [13:13:49] syncing [13:14:11] XioNoX: the revert looks good, I assume knams maintenance is finished and all is in order? [13:14:12] Urbanecm: pulled on mwdebug1001 if you wanna test [13:14:23] ema: indeed [13:14:27] !log hashar@tin Synchronized static/images/project-logos: (no justification provided) (duration: 00m 46s) [13:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:41] (03CR) 10Ema: [C: 031] Revert "Depooling ESAMS for T162601" [dns] - 10https://gerrit.wikimedia.org/r/348069 (owner: 10Ayounsi) [13:14:44] XioNoX: OK, +1, same dance as before to apply the change [13:14:54] cool, thanks! [13:15:00] (03CR) 10Ayounsi: [C: 032] Revert "Depooling ESAMS for T162601" [dns] - 10https://gerrit.wikimedia.org/r/348069 (owner: 10Ayounsi) [13:16:54] !log hashar@tin Synchronized dblists/closed.dblist: Close wikimania2016 - T161183 (duration: 00m 43s) [13:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:01] T161183: Close wikimania2016 wiki - https://phabricator.wikimedia.org/T161183 [13:17:03] hashar: please deploy [13:17:38] Urbanecm: for wikimania2016 I guess we should drop the central notice banner somehow [13:18:25] hashar: Yeah, I'll take care about it after the deploy. [13:18:40] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: (no justification provided) (duration: 00m 43s) [13:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:54] Urbanecm: then maybe we will have to reopen the wiki to disable the banner ? [13:19:35] hashar: no, it won't be needed. It is just needed to delete https://wikimania2016.wikimedia.org/wiki/MediaWiki:Sitenotice and stewards are able to do it even the wiki is closed. [13:19:45] (delete or edit) [13:19:52] \O/ [13:21:57] 06Operations, 10Traffic, 10netops, 13Patch-For-Review: knams equipment move - https://phabricator.wikimedia.org/T162601#3178159 (10ayounsi) 05Open>03Resolved esams reenabled in DNS, confirmed traffic is properly passing through knams. [13:24:56] 06Operations, 13Patch-For-Review, 15User-Elukey: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850#3178170 (10elukey) A bit of recap to explain what happened today. While I was testing T125735#3177819 on rdb2005:6380 I noticed some Icinga alarms related to rdb2005... [13:25:21] this is the summary of my attempts --^ [13:26:37] 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#1996056 (10elukey) Had to revert because of https://phabricator.wikimedia.org/T159850#... [13:27:10] (03PS5) 10Ema: Release pybal 1.13.6 [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/348066 (https://phabricator.wikimedia.org/T103882) [13:28:23] !log powercycling thumbor1001, stuck in reboot [13:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:23] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad is CRITICAL: Traceback (most recent call last) [13:32:31] 06Operations, 10ops-codfw, 10Traffic: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3178185 (10ema) Traffic switched to lvs2002 properly: https://grafana.wikimedia.org/dashboard/db/load-balancers?panelId=8&fullscreen&orgId=1&from=1492081888998&to=1492085531622 [13:32:57] hashar: The notice is now gone :) [13:34:04] Urbanecm: you are the best of us! [13:35:23] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 15 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [13:36:23] (03PS1) 10Muehlenhoff: Allow paged searches to exceed the size limit for searches [puppet] - 10https://gerrit.wikimedia.org/r/348071 (https://phabricator.wikimedia.org/T162745) [13:37:55] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 06DC-Ops: analytics1030 stuck in console while booting - https://phabricator.wikimedia.org/T162046#3178195 (10elukey) Now I am not able to reach the console too. I tried the following without success: ``` elukey@neodymium:~$ sudo ipmitool -I lanplus -H analyt... [13:38:00] (03PS1) 10BBlack: role::ntp: update manual upstream timeservers [puppet] - 10https://gerrit.wikimedia.org/r/348072 [13:44:19] Hi, im wondering how does wikimedia manage to get arguments in the nrpe plugin to work? Debian supposidly disabled it by default. [13:44:33] 06Operations, 10Traffic: Select location for Asia Cache DC - https://phabricator.wikimedia.org/T156029#3178221 (10H-stt) No, moving this discussion to another forum on Meta is not an acceptable option. The board's resolution demands action by the ops, so this issue needs to be discussed with the ops, and this... [13:48:28] (03CR) 10Muehlenhoff: [C: 04-1] role::ntp: update manual upstream timeservers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/348072 (owner: 10BBlack) [13:53:10] (03CR) 10Muehlenhoff: [C: 04-1] role::ntp: update manual upstream timeservers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/348072 (owner: 10BBlack) [13:56:53] (03CR) 10Muehlenhoff: [C: 04-1] "The others entries are confirmed to use an open access policy without notification required." [puppet] - 10https://gerrit.wikimedia.org/r/348072 (owner: 10BBlack) [13:57:03] PROBLEM - Check Varnish expiry mailbox lag on cp3037 is CRITICAL: CRITICAL: expiry mailbox lag is 866134 [14:00:58] 06Operations, 10Traffic: Select location for Asia Cache DC - https://phabricator.wikimedia.org/T156029#2962020 (10hashar) @H-stt a bit more context. Sustainability was one of the important selection criteria for the previous datacenter. See the request for comment from 2013 at https://wikimediafoundation.org/w... [14:12:54] (03CR) 10Ottomata: [C: 031] "Hm, seems fine, but it is too bad git::clone doesn't allow you to clone the same remote repo to different directory locations. I guess th" [puppet] - 10https://gerrit.wikimedia.org/r/347855 (owner: 10Gehel) [14:16:51] (03PS1) 10Ema: pybal: bind instrumentation TCP port to private addresses [puppet] - 10https://gerrit.wikimedia.org/r/348074 (https://phabricator.wikimedia.org/T103882) [14:17:20] 06Operations, 10Traffic: Select location for Asia Cache DC - https://phabricator.wikimedia.org/T156029#3178320 (10BBlack) This is the last time I'll respond to trolling on this ticket. >>! In T156029#3166703, @H-stt wrote: >>>! In T156029#3053235, @BBlack wrote: >> >>>>! In T156029#3053179, @Gnom1 wrote: >>... [14:20:54] (03CR) 10BBlack: role::ntp: update manual upstream timeservers (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/348072 (owner: 10BBlack) [14:21:27] (03PS2) 10BBlack: role::ntp: update manual upstream timeservers [puppet] - 10https://gerrit.wikimedia.org/r/348072 [14:22:08] (03CR) 10Ema: "pcc output here: https://puppet-compiler.wmflabs.org/6151/" [puppet] - 10https://gerrit.wikimedia.org/r/348074 (https://phabricator.wikimedia.org/T103882) (owner: 10Ema) [14:25:19] (03CR) 10Muehlenhoff: [C: 031] role::ntp: update manual upstream timeservers [puppet] - 10https://gerrit.wikimedia.org/r/348072 (owner: 10BBlack) [14:27:34] !log disabling puppet on recnds/ntp boxes to control patch rollout [14:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:55] (03CR) 10BBlack: [C: 032] role::ntp: update manual upstream timeservers [puppet] - 10https://gerrit.wikimedia.org/r/348072 (owner: 10BBlack) [14:29:31] (03PS2) 10Filippo Giunchedi: Make Thumbor connect to Swift via https [puppet] - 10https://gerrit.wikimedia.org/r/348061 (owner: 10Gilles) [14:31:20] (03CR) 10Filippo Giunchedi: [C: 032] Make Thumbor connect to Swift via https [puppet] - 10https://gerrit.wikimedia.org/r/348061 (owner: 10Gilles) [14:37:03] RECOVERY - Check Varnish expiry mailbox lag on cp3037 is OK: OK: expiry mailbox lag is 0 [14:39:20] jouncebot: next [14:39:20] In 1 hour(s) and 20 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170413T1600) [14:39:23] jouncebot: now [14:39:23] No deployments scheduled for the next 1 hour(s) and 20 minute(s) [14:42:53] PROBLEM - NTP peers on chromium is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [14:43:03] PROBLEM - NTP peers on nescio is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [14:43:53] RECOVERY - NTP peers on chromium is OK: NTP OK: Offset -5.7e-05 secs [14:45:03] RECOVERY - NTP peers on nescio is OK: NTP OK: Offset 0.000235 secs [14:50:45] !log installing bouncycastle security updates [14:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:53] RECOVERY - NTP peers on achernar is OK: NTP OK: Offset -0.007137 secs [14:51:13] PROBLEM - NTP peers on maerlant is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [14:53:03] PROBLEM - NTP peers on acamar is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [14:53:13] RECOVERY - NTP peers on maerlant is OK: NTP OK: Offset -3e-05 secs [14:54:00] (03PS2) 10Andrew Bogott: labtest: avoid broken Icinga checks on labtest [puppet] - 10https://gerrit.wikimedia.org/r/348022 (https://phabricator.wikimedia.org/T152024) (owner: 10Dzahn) [14:54:02] (03PS3) 10Andrew Bogott: Repool labvirt1001 [puppet] - 10https://gerrit.wikimedia.org/r/347887 (https://phabricator.wikimedia.org/T159835) [14:55:03] RECOVERY - NTP peers on acamar is OK: NTP OK: Offset 0.015021 secs [14:55:19] !log ppchelko@tin Started deploy [changeprop/deploy@e47afea]: Provide separate rules for ORES precaching in both DCs [14:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:17] !log ppchelko@tin Finished deploy [changeprop/deploy@e47afea]: Provide separate rules for ORES precaching in both DCs (duration: 00m 58s) [14:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:52] (03CR) 10Andrew Bogott: [C: 032] labtest: avoid broken Icinga checks on labtest [puppet] - 10https://gerrit.wikimedia.org/r/348022 (https://phabricator.wikimedia.org/T152024) (owner: 10Dzahn) [14:58:45] (03CR) 10Andrew Bogott: [C: 032] Repool labvirt1001 [puppet] - 10https://gerrit.wikimedia.org/r/347887 (https://phabricator.wikimedia.org/T159835) (owner: 10Andrew Bogott) [15:01:42] !log disabling puppet on seaborgium and serpens for a cautious merge of https://gerrit.wikimedia.org/r/#/c/348071 [15:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:33] !log disabling puppet on dubnium and pollux for a cautious merge of https://gerrit.wikimedia.org/r/#/c/348071 [15:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:55] (03PS2) 10Andrew Bogott: Allow paged searches to exceed the size limit for searches [puppet] - 10https://gerrit.wikimedia.org/r/348071 (https://phabricator.wikimedia.org/T162745) (owner: 10Muehlenhoff) [15:05:39] (03PS1) 10BBlack: role::ntp: replace one more faulty NTP upstream [puppet] - 10https://gerrit.wikimedia.org/r/348085 [15:06:13] PROBLEM - Host cp3045 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:26] ema: ^ ? [15:06:29] sigh, cp3045 is also not rebooting fine [15:06:39] ok :) [15:06:54] and no mgmt either [15:07:15] I've used mgmt on all of those cp30[34]x before I think [15:07:25] so it at least was working, at some relatively-recent point [15:07:46] (because I remember auditing BIOS settings on them all after the last time they were having these kinds of random issues) [15:07:52] (03CR) 10jenkins-bot: Convert $stdlogo to static/images/project-logos resources at Wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346054 (https://phabricator.wikimedia.org/T161980) (owner: 10Urbanecm) [15:08:07] (03CR) 10BBlack: [V: 032 C: 032] role::ntp: replace one more faulty NTP upstream [puppet] - 10https://gerrit.wikimedia.org/r/348085 (owner: 10BBlack) [15:08:25] (03CR) 10jenkins-bot: Switch from $stdlogo to static resources at Wikiversities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346047 (https://phabricator.wikimedia.org/T161980) (owner: 10Urbanecm) [15:08:53] (03CR) 10Andrew Bogott: [C: 032] Allow paged searches to exceed the size limit for searches [puppet] - 10https://gerrit.wikimedia.org/r/348071 (https://phabricator.wikimedia.org/T162745) (owner: 10Muehlenhoff) [15:09:00] (03PS3) 10Andrew Bogott: Allow paged searches to exceed the size limit for searches [puppet] - 10https://gerrit.wikimedia.org/r/348071 (https://phabricator.wikimedia.org/T162745) (owner: 10Muehlenhoff) [15:09:57] and if they are not listed in T150160, should have passed my audit of IPMI (ema, bblack) [15:09:57] T150160: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160 [15:10:07] at that time of course [15:10:25] 06Operations, 10ops-codfw: Swap NIC on mira - https://phabricator.wikimedia.org/T162859#3177514 (10RobH) The 1Gbit nic on ALL of our servers is a build in NIC. On the newest systems, they may be replaceable (not sure) but on these older ones, its part of the mainboard and when it goes, that is it. Sometimes... [15:10:36] ipmitool seems to confirm that something is wrong: [15:10:41] Error: Unable to establish IPMI v2 / RMCP+ session [15:11:13] ema: did you try ipmi before the reboot by any chance? [15:11:18] volans: nope [15:11:39] probably will have to pull power cords to fix it [15:12:05] 06Operations, 10ops-esams, 06DC-Ops: Broken IPMI/drac on cp3038 and cp3045 - https://phabricator.wikimedia.org/T157537#3178560 (10ema) [15:12:13] (side note: maybe we should start using PDUs that have ssh and let you shut ports off and on and label them, etc) [15:12:13] PROBLEM - NTP peers on maerlant is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [15:13:13] RECOVERY - NTP peers on maerlant is OK: NTP OK: Offset 0.00138 secs [15:13:29] 06Operations, 10ops-esams, 06DC-Ops: Broken IPMI/drac on cp3038 and cp3045 - https://phabricator.wikimedia.org/T157537#3008548 (10ema) Same issue on cp3045: the mgmt IP is reachable but I can't ssh into it. Further, `chassis status` fails with: ``` Error: Unable to establish IPMI v2 / RMCP+ session ``` Bot... [15:14:10] ACKNOWLEDGEMENT - Host cp3045 is DOWN: PING CRITICAL - Packet loss = 100% Ema T157537 [15:16:13] (03CR) 10Gehel: "Nope, not what is needed here (at least to my understanding, scap::target does the same, cloning a single repo). It looks to me that git:" [puppet] - 10https://gerrit.wikimedia.org/r/347855 (owner: 10Gehel) [15:16:30] 06Operations, 10ops-eqiad, 10Analytics, 06DC-Ops, 13Patch-For-Review: Decom/Reclaim analytics1027 - https://phabricator.wikimedia.org/T161597#3178581 (10Nuria) [15:17:55] 06Operations, 10ops-codfw: Swap NIC on mira - https://phabricator.wikimedia.org/T162859#3178588 (10Papaul) p:05Triage>03Normal [15:28:19] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 15User-Elukey: Reclaim/Decommission mw2090->mw2096 (OOW) - https://phabricator.wikimedia.org/T161488#3178711 (10Papaul) Disk wipe in progress [15:30:31] 06Operations, 10hardware-requests: spare pool allocation of WMF6406 to replace mira - https://phabricator.wikimedia.org/T162897#3178728 (10RobH) [15:31:07] 06Operations: Migrate all jessie hosts to Linux 4.9 - https://phabricator.wikimedia.org/T162029#3178745 (10ema) All cache_upload hosts upgraded to 4.9 and running fine, with the exception of cp3038 and cp3045 which have been upgraded but failed to reboot properly: T157537. [15:36:54] 06Operations, 10ops-codfw: setup naos/WMF6406 as new codfw deployment server - https://phabricator.wikimedia.org/T162900#3178786 (10RobH) [15:37:37] 06Operations, 10ops-codfw: Swap NIC on mira - https://phabricator.wikimedia.org/T162859#3178801 (10RobH) 05Open>03stalled a:05Papaul>03RobH @Papaul: Please don't bother to troubleshoot this, as we've progressed to replacing it outright with T162900. For now, I'll steal this task back until I copy dat... [15:37:46] papaul: ^ please dont bother to troubleshoot mira [15:38:04] just get the other new system (was graphite2003) converted back to sata via the task i just made and linked to it [15:38:06] =] [15:38:41] no point in you wasting time troubleshooting an old, out of warranty, due for replacement system [15:38:53] (03CR) 10Marostegui: [C: 031] "Go ahead then, it hasn't solved the timeouts entirely anyways :(" [puppet] - 10https://gerrit.wikimedia.org/r/347996 (owner: 10Jcrespo) [15:39:45] Hey Dereckson, I see you have some SWAT patches lined up, would you mind helping me out quick? [15:41:05] Dereckson: I have foolishly left off recovering my 2fa for wikitech until I need it, so I can't add the patch to the list, but I'm here to babysit it. [15:41:37] 06Operations, 10ops-codfw: setup naos/WMF6406 as new codfw deployment server - https://phabricator.wikimedia.org/T162900#3178811 (10RobH) p:05Normal>03High So the switchover from the eqiad deployment host (tin) to the codfw deployment host (mira) was scheduled for approximately April 19th. Ideally naos is... [15:42:26] PROBLEM - puppet last run on mc1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:20] 06Operations, 10ops-codfw: Swap NIC on mira - https://phabricator.wikimedia.org/T162859#3178815 (10Papaul) @RobH just note that mira has only 2 NIC's not 4 [15:43:42] robh: ^ [15:44:35] oh, even more reason to get off mira! [15:44:59] ahh, r320 [15:45:12] not used to the 3 series we tended to stop doing that for the 15usd it saved us ;] [15:45:26] we tend to do all r430s no r330s these days [15:45:58] 06Operations, 10ops-codfw: setup naos/WMF6406 as new codfw deployment server - https://phabricator.wikimedia.org/T162900#3178839 (10RobH) [15:46:23] !log mobrovac@tin Started deploy [citoid/deploy@212800d]: Enable multiple results for T115248 and remove b/c for T114515 [15:46:29] i forgot the physical label step, task fixed [15:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:30] T115248: Strings of digits currently only search for PMIDs; Add multiple results and include OCLC and PMC in the search - https://phabricator.wikimedia.org/T115248 [15:46:30] T114515: Add ability to request base types in citoid, and offer use of both in extension for backwards compatibility until all templateData has been updated, and undo its use in extension, including use of templateData - https://phabricator.wikimedia.org/T114515 [15:47:43] 06Operations, 10ops-codfw: Swap NIC on mira - https://phabricator.wikimedia.org/T162859#3178854 (10RobH) My mistake, it seems the R420s have 4 ports, but the R320s only had 2. Either way, we'll leave mira alone and online until after the replacement system naos is online and ready for use. [15:48:56] * marktraceur retracts request, patch being added [15:49:34] !log mobrovac@tin Finished deploy [citoid/deploy@212800d]: Enable multiple results for T115248 and remove b/c for T114515 (duration: 03m 10s) [15:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:41] 06Operations, 10Icinga, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: remove/fix "Check for gridmaster host resolution" Icinga check for "labtest" - https://phabricator.wikimedia.org/T152024#3178867 (10Dzahn) 05Open>03Resolved Thanks for merging @Andrew and it's gone from https://icinga.wikime... [15:55:15] (03CR) 10BBlack: [C: 031] pybal: bind instrumentation TCP port to private addresses [puppet] - 10https://gerrit.wikimedia.org/r/348074 (https://phabricator.wikimedia.org/T103882) (owner: 10Ema) [15:56:32] (03CR) 10BBlack: [C: 04-1] Release pybal 1.13.6 (031 comment) [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/348066 (https://phabricator.wikimedia.org/T103882) (owner: 10Ema) [15:59:14] (03PS2) 10Dzahn: DNS configuration for wb.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/347141 (https://phabricator.wikimedia.org/T162510) (owner: 10Urbanecm) [16:00:05] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170413T1600). Please do the needful. [16:00:05] thcipriani and Dereckson: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:13] o/ [16:00:46] Hi [16:01:21] marktraceur: sure, at the morning or the evening swat we can do them [16:01:52] Dereckson: Sorry, Matthias added it for me [16:02:01] And I misjudged the time [16:02:20] (03PS2) 10Volans: Switch master DC from eqiad to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346251 (owner: 10Giuseppe Lavagetto) [16:03:09] marktraceur: so SWAT timetable has been modified when we added the european window, it's now every five hours [16:03:12] (03CR) 10Dzahn: [C: 032] DNS configuration for wb.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/347141 (https://phabricator.wikimedia.org/T162510) (owner: 10Urbanecm) [16:04:03] (03CR) 10Volans: "Manually rebased to resolve conflicts with https://gerrit.wikimedia.org/r/#/c/347992/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346251 (owner: 10Giuseppe Lavagetto) [16:04:03] marktraceur: in summer, it's 13h, 18h and 23h UTC [16:04:31] (03PS2) 10Dzahn: Add wb.wikimedia.org to ServerAlias for wikimedia-chapter Vhost [puppet] - 10https://gerrit.wikimedia.org/r/347142 (https://phabricator.wikimedia.org/T162510) (owner: 10Urbanecm) [16:04:45] looking at the swat patches [16:05:59] (03CR) 10Dzahn: [C: 032] Add wb.wikimedia.org to ServerAlias for wikimedia-chapter Vhost [puppet] - 10https://gerrit.wikimedia.org/r/347142 (https://phabricator.wikimedia.org/T162510) (owner: 10Urbanecm) [16:06:09] (03PS3) 10Filippo Giunchedi: Scap: set deployment_server correctly [puppet] - 10https://gerrit.wikimedia.org/r/347898 (https://phabricator.wikimedia.org/T162814) (owner: 10Thcipriani) [16:06:32] (03PS2) 10Dzahn: DNS:Add mgmt and production DNS for db20[7-9][0-9] [dns] - 10https://gerrit.wikimedia.org/r/348037 (owner: 10Papaul) [16:08:46] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Scap: set deployment_server correctly [puppet] - 10https://gerrit.wikimedia.org/r/347898 (https://phabricator.wikimedia.org/T162814) (owner: 10Thcipriani) [16:10:19] Dereckson: Fair enough! [16:11:26] RECOVERY - puppet last run on mc1001 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [16:11:52] godog: after that runs on tin, I'll try a quick sync-file to verify it's still using the correct master_rsync and that there aren't unanticipated effects. [16:13:38] thcipriani: yup, puppet just finished [16:13:43] * thcipriani checks [16:13:49] * elukey waits for a GIT [16:13:53] *GIF [16:15:26] GIF-in-time [16:15:56] !log thcipriani@tin Synchronized README: [[gerrit:347918|scap.cfg change test]] (duration: 00m 44s) [16:16:02] (03CR) 10Dzahn: [C: 032] DNS:Add mgmt and production DNS for db20[7-9][0-9] [dns] - 10https://gerrit.wikimedia.org/r/348037 (owner: 10Papaul) [16:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:52] godog: looks normal to me, thanks for the merge! [16:17:12] thcipriani: np! happy to help [16:17:30] elukey: heh I don't know, was a smooth swat and mutante helped too [16:17:51] elukey: https://i.imgur.com/0jeCD73.gif [16:18:26] :D [16:20:32] (03CR) 10Dzahn: [C: 04-2] "yea, but what icinga2 has is not relevant for this change. what is interesting here is that we have "PURGESCRIPT="/usr/local/sbin/purge-na" [puppet] - 10https://gerrit.wikimedia.org/r/347899 (owner: 10Paladox) [16:21:15] !log mobrovac@tin Started deploy [citoid/deploy@b8c4cb2]: Test deploy for T162814 [16:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:22] T162814: Ensure deployment_server is global - https://phabricator.wikimedia.org/T162814 [16:23:39] !log mobrovac@tin Finished deploy [citoid/deploy@b8c4cb2]: Test deploy for T162814 (duration: 02m 24s) [16:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:52] (03PS3) 10Dzahn: site/prometheus: rm duplicate base::firewall, mv standard to role [puppet] - 10https://gerrit.wikimedia.org/r/347883 [16:24:14] godog: ^ how about that [16:25:45] (03CR) 10Thcipriani: [C: 031] Scap: Remove git_server from scap.cfg [software/prometheus_jmx_exporter] - 10https://gerrit.wikimedia.org/r/347924 (https://phabricator.wikimedia.org/T162814) (owner: 10Thcipriani) [16:25:55] Thanks for the wb deploys mutante [16:26:28] Dereckson: you're welcome [16:27:10] (03CR) 10Filippo Giunchedi: [C: 031] site/prometheus: rm duplicate base::firewall, mv standard to role [puppet] - 10https://gerrit.wikimedia.org/r/347883 (owner: 10Dzahn) [16:27:15] mutante: yeah LGTM, thanks! [16:28:06] :) thx [16:28:21] (03PS4) 10Dzahn: site/prometheus: rm duplicate base::firewall, mv standard to role [puppet] - 10https://gerrit.wikimedia.org/r/347883 [16:29:31] (03PS1) 10Jforrester: Set wgUsejQueryThree to false everywhere ahead of further testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348094 [16:30:35] (03CR) 10jerkins-bot: [V: 04-1] Set wgUsejQueryThree to false everywhere ahead of further testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348094 (owner: 10Jforrester) [16:31:39] 06Operations, 10ops-codfw: setup naos/WMF6406 as new codfw deployment server - https://phabricator.wikimedia.org/T162900#3179288 (10Papaul) [16:31:52] 06Operations, 10ops-codfw: setup naos/WMF6406 as new codfw deployment server - https://phabricator.wikimedia.org/T162900#3178786 (10Papaul) a:05Papaul>03RobH [16:32:34] 06Operations: acpi_pad issues - https://phabricator.wikimedia.org/T162850#3177068 (10MoritzMuehlenhoff) Is there a common pattern of affected distros/kernels/server models? I'd prefer if we first try to pinpoint this further before blacklisting it globally to avoid unforeseen side effects. [16:34:04] 06Operations, 10Monitoring, 07LDAP, 13Patch-For-Review: allow paging to work properly in ldap - https://phabricator.wikimedia.org/T162745#3173245 (10bd808) Test program: {P5267} [16:34:31] (03PS7) 10Hoo man: Change dumpwikidatattl to allow producing other flavors [puppet] - 10https://gerrit.wikimedia.org/r/347234 (https://phabricator.wikimedia.org/T155103) [16:34:33] (03PS4) 10Hoo man: Allow running two dumpwikidatattl dumps side by side [puppet] - 10https://gerrit.wikimedia.org/r/347838 (https://phabricator.wikimedia.org/T155103) [16:34:36] (03PS1) 10Hoo man: Wikidata entity dumps: Allow nt RDF dumps [puppet] - 10https://gerrit.wikimedia.org/r/348095 (https://phabricator.wikimedia.org/T155103) [16:34:37] (03PS1) 10Hoo man: Create truthy nt Wikidata entity dump each Monday [puppet] - 10https://gerrit.wikimedia.org/r/348096 (https://phabricator.wikimedia.org/T155103) [16:34:39] (03PS1) 10Hoo man: dumpwikidatardf adjust sanity check for truthy dumps [puppet] - 10https://gerrit.wikimedia.org/r/348097 [16:37:24] 06Operations, 10Monitoring, 07LDAP, 13Patch-For-Review: allow paging to work properly in ldap - https://phabricator.wikimedia.org/T162745#3179323 (10bd808) Current result is that paging works, but the total results returned across all pages is still capped at 2048. [16:42:48] (03CR) 10ArielGlenn: [C: 032] Change dumpwikidatattl to allow producing other flavors [puppet] - 10https://gerrit.wikimedia.org/r/347234 (https://phabricator.wikimedia.org/T155103) (owner: 10Hoo man) [16:44:34] (03CR) 10ArielGlenn: [C: 032] Allow running two dumpwikidatattl dumps side by side [puppet] - 10https://gerrit.wikimedia.org/r/347838 (https://phabricator.wikimedia.org/T155103) (owner: 10Hoo man) [16:49:37] !restored default value of client-output-buffer-limit on rdb1007:6379 - T159850 [16:49:37] T159850: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850 [16:50:13] the fully sync has happened and now only partital sycs should be executed, so theoretically we should be good.. I'll chat with Giuseppe and Alex to find a solution [16:50:59] in the task there is all the data to execute the command again if the replication fails [16:52:07] elukey: that didn't !log btw [16:55:21] hahahahah [16:55:28] * elukey facepalm [16:55:37] !log restored default value of client-output-buffer-limit on rdb1007:6379 - T159850 [16:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:46] T159850: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850 [16:55:49] thanks godog [16:56:00] haha no worries elukey [16:56:25] (03PS2) 10Hashar: jenkins: use configuration file for logging [puppet] - 10https://gerrit.wikimedia.org/r/347877 [16:57:20] 06Operations, 07HHVM: Frequent TCP RST on connections between HHVM and Redis - https://phabricator.wikimedia.org/T162354#3179377 (10MoritzMuehlenhoff) Package is built and available on copper for testing (but not yet uploaded to apt.wikimedia.org) [16:59:25] moritzm: we could install it in deployment-prep maybe? [17:00:05] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170413T1700). Please do the needful. [17:00:41] Amir1, did you want to try to do the deploy today or wait for next week? [17:01:31] no parsoid deploy today. [17:02:12] (03CR) 10Hashar: [C: 031] "I have applied it to jenkinstest.integration.eqiad.wmflabs . When restarting Jenkins the log output looks like https://phabricator.wikimed" [puppet] - 10https://gerrit.wikimedia.org/r/347877 (owner: 10Hashar) [17:02:41] I don't think there'll be an ORES deploy. [17:02:53] (03PS1) 10RobH: setting up dns for naos [dns] - 10https://gerrit.wikimedia.org/r/348103 [17:04:06] elukey: I'd install it on mw1261, most of the effects will only be visible with live traffic, but let's do that on Tuesday, not before the long weekend [17:04:10] 06Operations, 10ops-codfw, 06Performance-Team, 15User-fgiunchedi: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3179396 (10RobH) 05Open>03Resolved [17:05:22] (03CR) 10RobH: [C: 032] setting up dns for naos [dns] - 10https://gerrit.wikimedia.org/r/348103 (owner: 10RobH) [17:06:49] moritzm: sure! I'll double check RSTs before to make sure that we see some difference [17:07:16] ack [17:08:51] (03PS2) 10Hoo man: Create truthy nt Wikidata entity dump each Monday [puppet] - 10https://gerrit.wikimedia.org/r/348096 (https://phabricator.wikimedia.org/T155103) [17:08:52] (03PS2) 10Hoo man: dumpwikidatardf adjust sanity check for truthy dumps [puppet] - 10https://gerrit.wikimedia.org/r/348097 [17:09:45] (03PS1) 10RobH: setting naos install and site parameters [puppet] - 10https://gerrit.wikimedia.org/r/348104 [17:10:36] 06Operations, 10ops-codfw: setup naos/WMF6406 as new codfw deployment server - https://phabricator.wikimedia.org/T162900#3179418 (10RobH) [17:10:51] (03CR) 10jerkins-bot: [V: 04-1] setting naos install and site parameters [puppet] - 10https://gerrit.wikimedia.org/r/348104 (owner: 10RobH) [17:11:23] (03PS2) 10RobH: setting naos install and site parameters [puppet] - 10https://gerrit.wikimedia.org/r/348104 [17:12:23] (03CR) 10RobH: [C: 032] setting naos install and site parameters [puppet] - 10https://gerrit.wikimedia.org/r/348104 (owner: 10RobH) [17:13:01] (03CR) 10ArielGlenn: [C: 032] Wikidata entity dumps: Allow nt RDF dumps [puppet] - 10https://gerrit.wikimedia.org/r/348095 (https://phabricator.wikimedia.org/T155103) (owner: 10Hoo man) [17:13:13] (03PS2) 10ArielGlenn: Wikidata entity dumps: Allow nt RDF dumps [puppet] - 10https://gerrit.wikimedia.org/r/348095 (https://phabricator.wikimedia.org/T155103) (owner: 10Hoo man) [17:13:38] 06Operations, 06Performance-Team, 10Wikidata, 10Wikimedia-Site-requests: Increase $wgExpensiveParserFunctionLimit on nowiki - https://phabricator.wikimedia.org/T160685#3179430 (10jeblad) I'm not going to do further followup on this, and it is only a single place they are stuck on this limit. Will close as... [17:13:49] 06Operations, 06Performance-Team, 10Wikidata, 10Wikimedia-Site-requests: Increase $wgExpensiveParserFunctionLimit on nowiki - https://phabricator.wikimedia.org/T160685#3179431 (10jeblad) 05Open>03declined [17:14:34] 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3179434 (10elukey) >>! In T125735#3177634, @hashar wrote: > There is a global lock to... [17:15:39] 06Operations, 10ops-codfw: setup naos/WMF6406 as new codfw deployment server - https://phabricator.wikimedia.org/T162900#3179451 (10RobH) [17:16:17] 06Operations, 10ops-codfw: setup naos/WMF6406 as new codfw deployment server - https://phabricator.wikimedia.org/T162900#3178786 (10RobH) a:05RobH>03Papaul I just noticed the network port wasn't labeled on the switch with the asset tag, so I need @papaul to determine what it is: [] - @papaul to update thi... [17:16:48] (03PS3) 10ArielGlenn: Create truthy nt Wikidata entity dump each Monday [puppet] - 10https://gerrit.wikimedia.org/r/348096 (https://phabricator.wikimedia.org/T155103) (owner: 10Hoo man) [17:16:55] (03PS12) 10Paladox: Create an icinga2 class [puppet] - 10https://gerrit.wikimedia.org/r/347640 [17:17:08] 06Operations, 10ops-codfw: setup naos/WMF6406 as new codfw deployment server - https://phabricator.wikimedia.org/T162900#3179464 (10RobH) [17:18:23] (03PS1) 10Andrew Bogott: wmfkeystonehooks: Look in the whole tree for the next gid. [puppet] - 10https://gerrit.wikimedia.org/r/348105 [17:19:11] (03CR) 10ArielGlenn: [C: 032] Create truthy nt Wikidata entity dump each Monday [puppet] - 10https://gerrit.wikimedia.org/r/348096 (https://phabricator.wikimedia.org/T155103) (owner: 10Hoo man) [17:19:55] (03PS2) 10Andrew Bogott: wmfkeystonehooks: Look in the whole tree for the next gid. [puppet] - 10https://gerrit.wikimedia.org/r/348105 [17:20:44] (03PS3) 10ArielGlenn: dumpwikidatardf adjust sanity check for truthy dumps [puppet] - 10https://gerrit.wikimedia.org/r/348097 (owner: 10Hoo man) [17:23:07] (03CR) 10ArielGlenn: [C: 032] dumpwikidatardf adjust sanity check for truthy dumps [puppet] - 10https://gerrit.wikimedia.org/r/348097 (owner: 10Hoo man) [17:23:52] 06Operations: acpi_pad issues - https://phabricator.wikimedia.org/T162850#3179492 (10Dzahn) >>! In T162850#3179294, @MoritzMuehlenhoff wrote: > Is there a common pattern of affected distros/kernels/server models? | host | ticket | distro | kernel | server model | praseodymium | T123924 | jessie | 4.9.0-0.bpo.... [17:30:19] (03PS13) 10Paladox: Create an icinga2 class [puppet] - 10https://gerrit.wikimedia.org/r/347640 [17:31:24] (03CR) 10jerkins-bot: [V: 04-1] Create an icinga2 class [puppet] - 10https://gerrit.wikimedia.org/r/347640 (owner: 10Paladox) [17:32:16] (03PS14) 10Paladox: Create an icinga2 class [puppet] - 10https://gerrit.wikimedia.org/r/347640 [17:33:11] (03CR) 10jerkins-bot: [V: 04-1] Create an icinga2 class [puppet] - 10https://gerrit.wikimedia.org/r/347640 (owner: 10Paladox) [17:34:13] (03PS15) 10Paladox: Create an icinga2 class [puppet] - 10https://gerrit.wikimedia.org/r/347640 [17:34:34] 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3179539 (10elukey) @aaron any suggestion about the direction to take from your previou... [17:35:19] (03CR) 10jerkins-bot: [V: 04-1] Create an icinga2 class [puppet] - 10https://gerrit.wikimedia.org/r/347640 (owner: 10Paladox) [17:38:28] (03PS16) 10Paladox: Create an icinga2 class [puppet] - 10https://gerrit.wikimedia.org/r/347640 [17:39:34] (03CR) 10jerkins-bot: [V: 04-1] Create an icinga2 class [puppet] - 10https://gerrit.wikimedia.org/r/347640 (owner: 10Paladox) [17:40:52] (03PS17) 10Paladox: Create an icinga2 class [puppet] - 10https://gerrit.wikimedia.org/r/347640 [17:41:11] could stop spamming us with that? [17:41:38] feel free to work on it on your vagrant box or labs VM or whatever, but please stop spamming us with it [17:41:41] paladox: ^ [17:42:00] It's for labs. [17:42:25] I don't really care [17:42:34] jenkins isn't your personal code reviewer, stop doing that [17:45:59] (03CR) 10BryanDavis: wmfkeystonehooks: Look in the whole tree for the next gid. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/348105 (owner: 10Andrew Bogott) [17:48:06] (03PS7) 10Madhuvishy: tools: job to copytruncate logs in place [puppet] - 10https://gerrit.wikimedia.org/r/326153 (https://phabricator.wikimedia.org/T152235) (owner: 10Rush) [17:49:04] (03CR) 10jerkins-bot: [V: 04-1] tools: job to copytruncate logs in place [puppet] - 10https://gerrit.wikimedia.org/r/326153 (https://phabricator.wikimedia.org/T152235) (owner: 10Rush) [17:49:40] jouncebot: next [17:49:40] In 0 hour(s) and 10 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170413T1800) [17:49:59] Think I'll start merging my patches for that then [17:52:40] (03PS5) 10Dzahn: site/prometheus: rm duplicate base::firewall, mv standard to role [puppet] - 10https://gerrit.wikimedia.org/r/347883 [17:52:50] (03PS8) 10Madhuvishy: tools: job to copytruncate logs in place [puppet] - 10https://gerrit.wikimedia.org/r/326153 (https://phabricator.wikimedia.org/T152235) (owner: 10Rush) [17:54:02] (03CR) 10jerkins-bot: [V: 04-1] tools: job to copytruncate logs in place [puppet] - 10https://gerrit.wikimedia.org/r/326153 (https://phabricator.wikimedia.org/T152235) (owner: 10Rush) [17:57:32] (03PS9) 10Madhuvishy: tools: job to copytruncate logs in place [puppet] - 10https://gerrit.wikimedia.org/r/326153 (https://phabricator.wikimedia.org/T152235) (owner: 10Rush) [17:58:02] 06Operations, 10ops-codfw, 10hardware-requests: decommision nembus - https://phabricator.wikimedia.org/T162928#3179655 (10RobH) [17:59:58] (03CR) 10Faidon Liambotis: [C: 04-2] "Sorry, not gonna happen." [puppet] - 10https://gerrit.wikimedia.org/r/347640 (owner: 10Paladox) [18:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170413T1800). Please do the needful. [18:00:04] Dereckson, marktraceur, Reedy, and James_F: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:14] I don't mind swatting... [18:00:16] * marktraceur waves [18:00:17] Unless someone else is [18:00:25] Reedy: Go for it buddy, I'm here for moral support [18:00:42] 06Operations, 10ops-codfw, 10hardware-requests: decommision nembus - https://phabricator.wikimedia.org/T162928#3179685 (10RobH) p:05Triage>03Low [18:00:46] First of my patches is waiting for jerkins [18:01:18] i dont think that was a typo, if it thinks you dislike it it'll just go slower! [18:01:18] Reedy: Can you superintend mine? No-op setting the jQuery 3 config to the safe (false) default, have to run. :-( [18:01:30] Yeah, shouldn't be problems [18:01:53] Cool. [18:03:38] Tut tut jouncebot [18:03:39] ffs [18:03:43] Tut tut James_F [18:03:47] Not valid PHP :P [18:03:53] Wuh oh. [18:04:08] (03PS2) 10Reedy: Set wgUsejQueryThree to false everywhere ahead of further testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348094 (owner: 10Jforrester) [18:04:10] Missed a , [18:04:13] (03PS3) 10Andrew Bogott: wmfkeystonehooks: Look in the whole tree for the next gid. [puppet] - 10https://gerrit.wikimedia.org/r/348105 [18:04:14] (03CR) 10Dzahn: [C: 04-1] "we'll make this conditional and only blacklist it if facter 'F:productname = "PowerEdge R320"'" [puppet] - 10https://gerrit.wikimedia.org/r/348016 (owner: 10Dzahn) [18:04:42] looks like everyone elses are config.. So I can just keep merging mine in the background [18:04:55] 06Operations, 10ops-codfw: setup naos/WMF6406 as new codfw deployment server - https://phabricator.wikimedia.org/T162900#3179712 (10Papaul) a:05Papaul>03RobH ge-5/0/15 [18:05:01] (03PS2) 10Reedy: Run 3d2png with xfvb-run [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347893 (https://phabricator.wikimedia.org/T159717) (owner: 10Matthias Mullie) [18:05:04] (03CR) 10Reedy: [C: 032] Run 3d2png with xfvb-run [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347893 (https://phabricator.wikimedia.org/T159717) (owner: 10Matthias Mullie) [18:05:07] Hoorah [18:05:18] Reedy: Bah. [18:05:25] Fixed it for you :P [18:05:36] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [18:05:41] (03CR) 10jerkins-bot: [V: 04-1] wmfkeystonehooks: Look in the whole tree for the next gid. [puppet] - 10https://gerrit.wikimedia.org/r/348105 (owner: 10Andrew Bogott) [18:06:48] (03Merged) 10jenkins-bot: Run 3d2png with xfvb-run [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347893 (https://phabricator.wikimedia.org/T159717) (owner: 10Matthias Mullie) [18:07:17] (03CR) 10jenkins-bot: Run 3d2png with xfvb-run [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347893 (https://phabricator.wikimedia.org/T159717) (owner: 10Matthias Mullie) [18:07:45] (03PS3) 10Reedy: Set wgUsejQueryThree to false everywhere ahead of further testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348094 (owner: 10Jforrester) [18:07:49] (03CR) 10Reedy: [C: 032] Set wgUsejQueryThree to false everywhere ahead of further testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348094 (owner: 10Jforrester) [18:09:00] !log reedy@tin Synchronized wmf-config/CommonSettings-labs.php: Run 3d2png with xfvb-run on beta (duration: 00m 43s) [18:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:52] Hmmm [18:10:57] (03Merged) 10jenkins-bot: Set wgUsejQueryThree to false everywhere ahead of further testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348094 (owner: 10Jforrester) [18:11:06] (03CR) 10BryanDavis: wmfkeystonehooks: Look in the whole tree for the next gid. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/348105 (owner: 10Andrew Bogott) [18:11:07] (03CR) 10jenkins-bot: Set wgUsejQueryThree to false everywhere ahead of further testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348094 (owner: 10Jforrester) [18:11:17] Reedy: Looks like the change didn't fix anything, so the two 3d files on beta are just as broken as they were before. But thanks for trying :) [18:11:27] marktraceur: Has it actually deployed to beta yet? [18:11:31] It takes a little while sometimes [18:11:32] Yeah, the jobs ran [18:11:37] https://upload.beta.wmflabs.org/wikipedia/commons/thumb/e/e3/F_%281%29.stl/800px-F_%281%29.stl.png shows the new command [18:12:27] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: Set wgUsejQueryThree to false everywhere ahead of further testing (duration: 00m 43s) [18:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:40] (03PS2) 10Reedy: Document Education Program task reference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347989 (owner: 10Dereckson) [18:13:44] (03CR) 10Reedy: [C: 032] Document Education Program task reference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347989 (owner: 10Dereckson) [18:14:26] 06Operations, 10ops-codfw: setup naos/WMF6406 as new codfw deployment server - https://phabricator.wikimedia.org/T162900#3179793 (10RobH) [18:14:46] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [18:15:01] (03Merged) 10jenkins-bot: Document Education Program task reference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347989 (owner: 10Dereckson) [18:15:05] (03PS4) 10Andrew Bogott: wmfkeystonehooks: Look in the whole tree for the next gid. [puppet] - 10https://gerrit.wikimedia.org/r/348105 [18:15:22] Dereckson: About? [18:15:26] yes I'm here [18:15:28] Reedy: Well, since that changed nothing, I'm going to head off now and get home for my next meeting, thanks for the push [18:16:20] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: Document EducationProgram config (duration: 00m 43s) [18:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:50] !log T161243: Truncating parsoid tables (default storage group) [18:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:59] T161243: Truncate wikimedia and wikidata storage groups - https://phabricator.wikimedia.org/T161243 [18:17:09] (03CR) 10jenkins-bot: Document Education Program task reference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347989 (owner: 10Dereckson) [18:19:24] (03Abandoned) 10Paladox: Create an icinga2 class [puppet] - 10https://gerrit.wikimedia.org/r/347640 (owner: 10Paladox) [18:19:28] (03CR) 10Dzahn: [C: 032] site/prometheus: rm duplicate base::firewall, mv standard to role [puppet] - 10https://gerrit.wikimedia.org/r/347883 (owner: 10Dzahn) [18:19:46] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 15 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [18:19:48] !log reedy@tin Synchronized php-1.29.0-wmf.19/extensions/Wikidata: Stop some logspam for deprecated hook usage (duration: 02m 14s) [18:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:01] (03PS3) 10Dzahn: jenkins: use configuration file for logging [puppet] - 10https://gerrit.wikimedia.org/r/347877 (owner: 10Hashar) [18:21:23] (03CR) 10Andrew Bogott: [C: 032] wmfkeystonehooks: Look in the whole tree for the next gid. [puppet] - 10https://gerrit.wikimedia.org/r/348105 (owner: 10Andrew Bogott) [18:21:24] !log reedy@tin Synchronized php-1.29.0-wmf.20/extensions/LiquidThreads: Stop some logspam for deprecated hooks (duration: 00m 45s) [18:21:27] (03PS5) 10Andrew Bogott: wmfkeystonehooks: Look in the whole tree for the next gid. [puppet] - 10https://gerrit.wikimedia.org/r/348105 [18:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:00] !log reedy@tin Synchronized php-1.29.0-wmf.20/extensions/WikimediaEvents: Stop some logspam for deprecated hooks (duration: 00m 43s) [18:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:29] (03CR) 10Dzahn: [C: 032] jenkins: use configuration file for logging [puppet] - 10https://gerrit.wikimedia.org/r/347877 (owner: 10Hashar) [18:23:34] (03PS4) 10Dzahn: jenkins: use configuration file for logging [puppet] - 10https://gerrit.wikimedia.org/r/347877 (owner: 10Hashar) [18:25:18] !log reedy@tin Synchronized php-1.29.0-wmf.20/extensions/Wikidata: Stop some logspam for deprecated hooks (duration: 02m 06s) [18:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:26] (03PS2) 10Reedy: Enable NewUserMessage on tr.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345982 (https://phabricator.wikimedia.org/T161962) (owner: 10Dereckson) [18:26:30] (03CR) 10Reedy: [C: 032] Enable NewUserMessage on tr.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345982 (https://phabricator.wikimedia.org/T161962) (owner: 10Dereckson) [18:26:46] @seen hashar [18:26:46] mutante: Last time I saw hashar they were quitting the network with reason: Quit: Textual IRC Client: www.textualapp.com N/A at 4/13/2017 5:18:46 PM (1h7m59s ago) [18:27:50] (03PS1) 10Chad: Group2 to wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348116 [18:27:54] Reedy: tables already created for Education Program in it.Wikiversity by the way [18:28:00] perfect :) [18:29:24] (03Merged) 10jenkins-bot: Enable NewUserMessage on tr.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345982 (https://phabricator.wikimedia.org/T161962) (owner: 10Dereckson) [18:29:33] (03CR) 10jenkins-bot: Enable NewUserMessage on tr.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345982 (https://phabricator.wikimedia.org/T161962) (owner: 10Dereckson) [18:29:53] !log restarting jenkins service to apply logging change gerrit:347877. it was already tested on jenkinstest.integration.eqiad.wmflabs [18:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:59] !log T161243: Truncating parsoid tables (wikimedia storage group) [18:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:06] T161243: Truncate wikimedia and wikidata storage groups - https://phabricator.wikimedia.org/T161243 [18:30:36] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: Enable NewUserMessage on tr.wikiquote T161962 (duration: 00m 43s) [18:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:43] T161962: Enable NewUserMessage on trwikiquote - https://phabricator.wikimedia.org/T161962 [18:30:53] (03PS2) 10Reedy: Enable AbuseFilter blocks on tr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345983 (https://phabricator.wikimedia.org/T161960) (owner: 10Dereckson) [18:30:57] (03CR) 10Reedy: [C: 032] Enable AbuseFilter blocks on tr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345983 (https://phabricator.wikimedia.org/T161960) (owner: 10Dereckson) [18:31:22] 06Operations, 10Deployment-Systems, 10RESTBase, 06Services, 13Patch-For-Review: [Discussion] Move restbase config to Ansible (or $deploy_system in general)? - https://phabricator.wikimedia.org/T107532#3179892 (10Pchelolo) 05Open>03Resolved a:03Pchelolo After moving to scap3 this ticket is obsolete,... [18:31:56] Reedy: 456 editCheckboxes() expects exactly 3 parameters, 2 given in /srv/mediawiki/php-1.29.0-wmf.20/extensions/LiquidThreads/classes/Hooks.php on line 317 [18:32:13] bleugh [18:32:21] (03Merged) 10jenkins-bot: Enable AbuseFilter blocks on tr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345983 (https://phabricator.wikimedia.org/T161960) (owner: 10Dereckson) [18:32:35] (03CR) 10jenkins-bot: Enable AbuseFilter blocks on tr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345983 (https://phabricator.wikimedia.org/T161960) (owner: 10Dereckson) [18:32:36] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [18:32:41] https://gerrit.wikimedia.org/r/#/c/348089/1/classes/Hooks.php [18:32:48] Dereckson: Thanks. Looks like doc removed, parameter nor [18:34:05] logfiles that use AM and PM for timestamps.. duh [18:35:21] !log reedy@tin Synchronized wmf-config/abusefilter.php: Enable AbuseFilter blocks on tr.wikipedia T161960 (duration: 00m 43s) [18:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:28] T161960: Enable the blocking feature of AbuseFilter on trwiki - https://phabricator.wikimedia.org/T161960 [18:35:37] (03PS2) 10Reedy: Enable Education Program on it.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347987 (https://phabricator.wikimedia.org/T162692) (owner: 10Dereckson) [18:35:54] (03CR) 10Reedy: [C: 032] Enable Education Program on it.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347987 (https://phabricator.wikimedia.org/T162692) (owner: 10Dereckson) [18:36:47] (03CR) 10Dzahn: "restarted jenkins service on contint100, no issues, the logfile now does not use AM/PM anymore." [puppet] - 10https://gerrit.wikimedia.org/r/347877 (owner: 10Hashar) [18:38:16] !log reedy@tin Synchronized php-1.29.0-wmf.20/extensions/LiquidThreads: Remove extra parameter from hook (duration: 00m 45s) [18:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:56] (03Merged) 10jenkins-bot: Enable Education Program on it.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347987 (https://phabricator.wikimedia.org/T162692) (owner: 10Dereckson) [18:38:57] Reedy: James_F fix in master seems to address the LQT concern too [18:39:05] (03CR) 10jenkins-bot: Enable Education Program on it.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347987 (https://phabricator.wikimedia.org/T162692) (owner: 10Dereckson) [18:39:07] (03PS1) 10Krinkle: Set wgUsejQueryThree to false in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348120 [18:39:08] (Switch from deprecated EditPageBeforeEditChecks to EditPageGetCheckboxesDefinition) [18:39:35] (03CR) 10Krinkle: "Also removes 'wgIncludejQueryMigrate' which on longer exists." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348120 (owner: 10Krinkle) [18:40:12] (03CR) 10Reedy: [C: 04-1] "Might want to rebase ontop of https://gerrit.wikimedia.org/r/#/c/348094/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348120 (owner: 10Krinkle) [18:40:41] (03CR) 10Krinkle: "Oh, didn't know that already happened." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348120 (owner: 10Krinkle) [18:41:40] (03PS2) 10Krinkle: Enable wgUsejQueryThree in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348120 [18:42:54] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: Enable Education Program on it.wikiversity T162692 (duration: 00m 43s) [18:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:03] T162692: Install Extension:Education_Program on it.wikiversity - https://phabricator.wikimedia.org/T162692 [18:44:50] James_F: thx [18:44:51] (03PS2) 10Reedy: Clean Wikisource namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337520 (https://phabricator.wikimedia.org/T46320) (owner: 10Dereckson) [18:45:00] Dereckson: I guess we should check this one on mwdebug :P [18:45:07] (03CR) 10Reedy: [C: 032] Clean Wikisource namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337520 (https://phabricator.wikimedia.org/T46320) (owner: 10Dereckson) [18:45:20] (03CR) 10Krinkle: [C: 031] Remove deprecated config option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347521 (owner: 10Jdlrobson) [18:45:22] * Dereckson nods [18:46:31] brb [18:46:47] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Labvirt1001 has insanely slow IO - https://phabricator.wikimedia.org/T159835#3180147 (10Andrew) 05Open>03Resolved Repooled, seems fine. [18:46:57] (03Merged) 10jenkins-bot: Clean Wikisource namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337520 (https://phabricator.wikimedia.org/T46320) (owner: 10Dereckson) [18:47:10] (03CR) 10jenkins-bot: Clean Wikisource namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337520 (https://phabricator.wikimedia.org/T46320) (owner: 10Dereckson) [18:48:51] Dereckson: It's on mwdebug1001 [18:49:46] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:50:18] LGTM on https://en.wikisource.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces%7Cnamespacealiases [18:50:36] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 210 bytes in 0.354 second response time [18:50:56] (03PS1) 10Urbanecm: Add logos for wbwikimedia to the filesystem [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348122 (https://phabricator.wikimedia.org/T162510) [18:52:03] Reedy: I'm checking with https://....wikisource.org/wiki/Special:Random/Index and /Page [18:52:08] heh [18:52:17] (good for the first few) [18:52:19] (03PS2) 10Urbanecm: Add logos for wbwikimedia to the filesystem [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348122 (https://phabricator.wikimedia.org/T162510) [18:53:29] Dereckson: As you may saw in the patches above I've prepared a patch adding the logos for T162510 . If the wikis can be created together it'll be great! [18:53:29] T162510: Create fishbowl wiki including a blog space for West Bengal Wikimedians User Group - https://phabricator.wikimedia.org/T162510 [18:54:16] PROBLEM - puppet last run on db1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:54:17] Urbanecm: yes it's possible, Daniel handled dns and apache part [18:54:31] Urbanecm: add it to the deployments calendar with pa.wikisource [18:54:44] Dereckson: to your window? [18:54:46] yes [18:55:02] Okay, I'll add it there. [18:55:55] Reedy: good for the dozen I checked [18:56:20] sweet [18:57:00] * James_F is back; thanks Reedy, you're welcome Krinkle. [18:57:09] Dereckson: Just a question. This T162513 isn't needed? Did I create it without any actual need? [18:57:09] T162513: Prepare and check storage layer for wbwikimedia - https://phabricator.wikimedia.org/T162513 [18:57:30] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: Clean Wikisource namespaces T46320 (duration: 00m 43s) [18:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:36] T46320: Clean Page and Index namespaces configuration for Wikisource - https://phabricator.wikimedia.org/T46320 [18:57:47] Urbanecm: oh oh [18:57:57] Urbanecm: so, no, we can't create it this evening [18:58:20] new workflow adds it's labs replication first, wiki creation afterwards [18:58:47] I didn't found any public wikimedia wiki replicated to labs... [18:59:17] ah ? [19:00:04] RainbowSprinkles: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170413T1900). Please do the needful. [19:00:15] Urbanecm: er yes [19:00:16] dereckson@tools-bastion-02:~$ sql bewikimedia [19:00:17] I wanted to say I didn't found any *wikimedia wiki replicated to labs. But I accept it can't be created, I'll note it in the task. [19:00:18] MariaDB [bewikimedia_p]> [19:00:56] I checked a table, we've data [19:01:29] it's not for .beta.wmflabs actually, it's to be available in labs cluster as a replication [19:01:37] Okay, it is there... Never mind :) [19:01:44] So tools, bots, services like quarry can query it. [19:02:19] https://phabricator.wikimedia.org/T162102 <- same here I can't create it [19:03:01] a pity, contributors submitted namespaces localisation swiftly when I noticed this morning there was missing [19:03:52] (03CR) 10Jforrester: [C: 04-1] "Premature, I think. Let's do a wikitech-l post and get unit tests passing across all WMF-deployed repos first?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348120 (owner: 10Krinkle) [19:12:21] (03CR) 10Chad: [C: 032] Group2 to wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348116 (owner: 10Chad) [19:14:50] (03Merged) 10jenkins-bot: Group2 to wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348116 (owner: 10Chad) [19:15:04] (03CR) 10jenkins-bot: Group2 to wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348116 (owner: 10Chad) [19:16:50] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group2 to wmf.20 [19:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:04] (03CR) 10Ppchelko: [C: 04-1] "Since the RESTBase config was moved to the deploy repo this patch is obsolete. I will create a new one against the RESTBase deploy repo. F" [puppet] - 10https://gerrit.wikimedia.org/r/345877 (https://phabricator.wikimedia.org/T161284) (owner: 10GWicke) [19:20:54] 06Operations, 10Page-Previews, 06Performance-Team, 06Reading-Web-Backlog, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#3180529 (10ovasileva) [19:22:16] RECOVERY - puppet last run on db1053 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [19:22:32] 06Operations, 10hardware-requests: spare pool allocation of WMF6406 to replace mira - https://phabricator.wikimedia.org/T162897#3178728 (10RobH) a:05RobH>03faidon I just made this task to track the actual approval of the host. @faidon approved this host via IRC earlier today, just assigning to him at lo... [19:25:09] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "This is ok only if we remember to bind instrumentation on all pybal instances to the IP the hostname resolves to, or this will break poole" [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/348066 (https://phabricator.wikimedia.org/T103882) (owner: 10Ema) [19:25:55] 06Operations, 10ops-eqiad, 10Phabricator, 06Release-Engineering-Team, 10hardware-requests: replacement hardware for iridium (phabricator) - https://phabricator.wikimedia.org/T156970#3180559 (10RobH) a:05mark>03faidon I'm assigning this to @faidon since @mark is away at present. @faidon: Basically I... [19:41:36] PROBLEM - nova-compute process on labvirt1013 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [19:42:00] is that epxected ^^? [19:42:36] RECOVERY - nova-compute process on labvirt1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [19:54:17] (03Abandoned) 10GWicke: WIP: Add cache-control option that allows for short term client caching [puppet] - 10https://gerrit.wikimedia.org/r/345877 (https://phabricator.wikimedia.org/T161284) (owner: 10GWicke) [19:56:10] 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3180700 (10hashar) Nice test @elukey, looks like there is at least a TCP connection es... [20:07:58] !log T161243: Clearing all snapshots [20:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:05] T161243: Truncate wikimedia and wikidata storage groups - https://phabricator.wikimedia.org/T161243 [20:10:39] 06Operations, 10hardware-requests: codfw: (3) hardware access request for labtest - https://phabricator.wikimedia.org/T154664#3180739 (10RobH) Ok, this one is slightly confusing when taken in with all the other labs requests, so I had to put all of them in to a single spreadsheet with the CPU/memory/disk/nic r... [20:11:13] 06Operations, 10hardware-requests: codfw: (3) hardware access request for labtest - https://phabricator.wikimedia.org/T154664#2919499 (10RobH) a:03RobH [20:11:43] 06Operations, 06Labs, 10hardware-requests: eqiad: (1) hardware access request for labnodepool1002 - https://phabricator.wikimedia.org/T161753#3141989 (10RobH) a:03RobH Ok, this one is slightly confusing when taken in with all the other labs requests, so I had to put all of them in to a single spreadsheet w... [20:12:14] 06Operations, 06Labs, 10hardware-requests: Codfw: (2) hardware access request for labtest [region 2] - https://phabricator.wikimedia.org/T161766#3142263 (10RobH) Ok, this one is slightly confusing when taken in with all the other labs requests, so I had to put all of them in to a single spreadsheet with the... [20:12:37] 06Operations, 10hardware-requests: eqiad: (2) hardware access request for californium and silver (labweb1001/1002) - https://phabricator.wikimedia.org/T161752#3141925 (10RobH) Ok, this one is slightly confusing when taken in with all the other labs requests, so I had to put all of them in to a single spreadshe... [20:12:51] 06Operations, 06Labs, 10hardware-requests: Codfw: (1) hardware access request for labtestneutron refresh - https://phabricator.wikimedia.org/T154706#2921133 (10RobH) Ok, this one is slightly confusing when taken in with all the other labs requests, so I had to put all of them in to a single spreadsheet with... [20:13:30] 06Operations, 06Labs, 10hardware-requests: Codfw: (1) hardware access request for labtestnet2003 [region 2] - https://phabricator.wikimedia.org/T161764#3142232 (10RobH) Ok, this one is slightly confusing when taken in with all the other labs requests, so I had to put all of them in to a single spreadsheet wi... [20:14:08] 06Operations, 06Labs, 10hardware-requests: Eqiad: (2) hardware access request for labcontrol1003/1004 - https://phabricator.wikimedia.org/T158207#3029754 (10RobH) Ok, this one is slightly confusing when taken in with all the other labs requests, so I had to put all of them in to a single spreadsheet with the... [20:14:36] 06Operations, 06Labs, 10hardware-requests: Codfw: (1) hardware access request for labtestvirt2003 [region 2] - https://phabricator.wikimedia.org/T161765#3142249 (10RobH) Ok, this one is slightly confusing when taken in with all the other labs requests, so I had to put all of them in to a single spreadsheet w... [20:14:56] Ok, that is 9 of 11 of those requests sorted and dispatched [20:15:12] the foodening is upon us. [20:17:31] (03PS1) 10Andrew Bogott: wmfkeystonehooks: Work around a keystone bug with role removal [puppet] - 10https://gerrit.wikimedia.org/r/348135 (https://phabricator.wikimedia.org/T162615) [20:36:06] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=2480.20 Read Requests/Sec=1073.80 Write Requests/Sec=1.70 KBytes Read/Sec=39471.20 KBytes_Written/Sec=424.00 [20:44:06] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=16.90 Read Requests/Sec=28.30 Write Requests/Sec=81.50 KBytes Read/Sec=154.00 KBytes_Written/Sec=976.40 [20:45:22] 06Operations, 10ops-codfw, 10DBA, 10netops: db20[7-9][0-9] swith ports configuration - https://phabricator.wikimedia.org/T162944#3180944 (10Papaul) [20:45:56] 06Operations, 10ops-codfw, 10DBA, 10netops: db20[7-9][0-9] switch ports configuration - https://phabricator.wikimedia.org/T162944#3180963 (10Luke081515) [20:47:03] moritzm, o/ [20:47:15] I was hoping to talk about when you'd like to do maintenance on the postgres server in labs [20:47:22] You pinged me about Wikilabels. [20:47:32] I want to pick a maintenance window so I can announce it to my users :D [20:48:54] 06Operations, 10ops-codfw, 10DBA, 10netops: db20[7-9][0-9] swith ports configuration - https://phabricator.wikimedia.org/T162944#3180980 (10Papaul) [20:49:52] 06Operations, 10ops-codfw, 10DBA, 10netops: db20[7-9][0-9] switch ports configuration - https://phabricator.wikimedia.org/T162944#3180944 (10Papaul) [20:51:36] (03PS1) 10ArielGlenn: process zero-length text entries as regular sql inserts [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/348138 [20:54:56] (03CR) 10jerkins-bot: [V: 04-1] process zero-length text entries as regular sql inserts [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/348138 (owner: 10ArielGlenn) [20:59:00] halfak: just propose a time/date, ideally SF morning / EU evening [20:59:30] 1300 UTC on 4/21? [20:59:42] Well, actually 1400 UTC would be a bit easier for me [20:59:57] Boo to SF. let them sleep :) [21:00:04] Dereckson: Respected human, time to deploy Wiki creation (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170413T2100). Please do the needful. [21:00:06] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1221.50 Read Requests/Sec=401.30 Write Requests/Sec=3.40 KBytes Read/Sec=39600.00 KBytes_Written/Sec=135.20 [21:00:21] 06Operations, 10ops-codfw, 10DBA, 10netops: db20[7-9][0-9] switch ports configuration - https://phabricator.wikimedia.org/T162944#3181028 (10Papaul) [21:00:23] moritzm, ^ [21:01:06] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=15.80 Read Requests/Sec=0.60 Write Requests/Sec=2.60 KBytes Read/Sec=34.40 KBytes_Written/Sec=45.60 [21:01:23] halfak: 14:00 UTC on 4/21 sounds good to me [21:01:40] Cool I'll start sending announcements. Thanks :) [21:01:52] (Also good to know to expect you to be around in EU time :D) [21:02:29] EU time is all the timezones covered by the European Union :-) [21:03:48] Yeah, but for all of those timezones, this is kind of late [21:04:06] Aren't there, like, 3 timezones. [21:04:37] I feel like if you said US-time, I'd assume you were talking about something between UTC-5 and UTC-8. [21:05:25] if should hardly be notable in any time zone, maybe 10 seconds tops :-) [21:06:24] moritzm, the maintenance? I've been burnt. I'm going to be prepared for a good 2 hour oops and happy about a 10 second bump :) [21:07:25] ok :-) [21:17:14] 06Operations, 10puppet-compiler: hosts with puppet compiler failures on every run - https://phabricator.wikimedia.org/T162949#3181168 (10Dzahn) [21:21:59] 06Operations, 10ops-ulsfo, 10fundraising-tech-ops, 13Patch-For-Review: rack/setup frbackup2001 - https://phabricator.wikimedia.org/T162469#3181205 (10Jgreen) [21:34:49] (03PS1) 10Dereckson: Initial configuration for pa.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348148 (https://phabricator.wikimedia.org/T149522) [21:36:31] Urbanecm: I've checked with Reedy, it seems to prepare the labs replication MUST be done before for *private* wikis, to avoid a public replication, so yes, you can add wb.wikimedia [21:36:52] that should be fine normally [21:36:53] 06Operations, 06DC-Ops, 10fundraising-tech-ops: decom barium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T162952#3181286 (10cwdent) [21:37:51] (we've https://wikitech.wikimedia.org/wiki/Add_a_wiki a big warning IMPORTANT: For Private Wikis) [21:43:45] (03PS1) 10ArielGlenn: last page range for page content job would sometimes have too many revs [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/348153 [21:43:55] (03CR) 10jerkins-bot: [V: 04-1] last page range for page content job would sometimes have too many revs [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/348153 (owner: 10ArielGlenn) [21:44:07] yeah wrong branch [21:44:23] (03Abandoned) 10ArielGlenn: last page range for page content job would sometimes have too many revs [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/348153 (owner: 10ArielGlenn) [21:46:58] (03PS2) 10ArielGlenn: last page range for page content job would sometimes have too many revs [dumps] - 10https://gerrit.wikimedia.org/r/347627 [21:47:18] and that's it for the night, good night all [21:47:40] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/6152/" [puppet] - 10https://gerrit.wikimedia.org/r/347023 (owner: 10Dzahn) [21:48:46] !log demon@tin Started scap: pruned cdb files from wmf.18 [21:48:53] good night apergos [21:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:02] 06Operations, 06Labs: rebuild tools-grid-master as a large instance - https://phabricator.wikimedia.org/T162955#3181356 (10chasemp) [21:53:34] (03PS1) 10Mobrovac: RESTBase: Add the CXServer service URI [puppet] - 10https://gerrit.wikimedia.org/r/348154 (https://phabricator.wikimedia.org/T107914) [21:54:16] 'night [21:54:29] 06Operations, 06DC-Ops, 10fundraising-tech-ops: decom barium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T162952#3181286 (10Dzahn) barium appears in an exim config template: modules/role/templates/exim/exim4.conf.mx.erb ``` 206 # Send donate.wikimedia.org mail to Fundraising CiviCRM server 207 d... [21:55:11] cwd: ^ did you know about barium in that template? [21:55:33] mutante: i did not, thanks for the heads up [21:55:59] i know nothing about prod puppet :-\ [21:56:14] alright, so see that comment in there [21:56:18] 206 # Send donate.wikimedia.org mail to Fundraising CiviCRM server [21:56:28] the new server is still civicrm, right [21:56:41] !log demon@tin Finished scap: pruned cdb files from wmf.18 (duration: 07m 55s) [21:56:45] yep, it's identical just jessie, dns name is civi1001 [21:56:47] and you also still want it to receive donate@ mail [21:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:17] mutante: afaik it should stay that way [21:57:37] we flipped the IP to the new server so the mail should make it through the firewall [21:57:43] cwd: do you want me to upload that change to gerrit and then maybe you can get a review for it? [21:57:56] mutante: that would be great, thanks [21:58:11] ok [21:58:20] i'll email jeff about it, he'll see it before tomorrow [21:58:27] i'm guessing he has +2 [21:58:45] yea, that's basically what i meant, and he does [21:58:50] good [21:59:18] (03PS2) 10Dereckson: Initial configuration for pa.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348148 (https://phabricator.wikimedia.org/T149522) [21:59:32] mutante: thanks much for the alert [21:59:57] cwd: you're welcome [22:00:27] 06Operations, 06Labs: rebuild tools-grid-master as a large instance - https://phabricator.wikimedia.org/T162955#3181425 (10chasemp) If this was successful I wonder if we could easily bake in http://wiki.gridengine.info/wiki/index.php/RQS_Common_Uses#Max_user_jobs_in_a_particular_queue [22:00:48] (03CR) 10Dereckson: "PS2: regenerate logos from SVG with the correct font" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348148 (https://phabricator.wikimedia.org/T149522) (owner: 10Dereckson) [22:00:53] (03CR) 10Dereckson: [C: 032] Initial configuration for pa.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348148 (https://phabricator.wikimedia.org/T149522) (owner: 10Dereckson) [22:01:13] (03PS10) 10Rush: tools: job to copytruncate logs in place [puppet] - 10https://gerrit.wikimedia.org/r/326153 (https://phabricator.wikimedia.org/T152235) [22:02:32] (03PS1) 10Dzahn: exim/fundraising: replace barium with civi1001 [puppet] - 10https://gerrit.wikimedia.org/r/348158 (https://phabricator.wikimedia.org/T162952) [22:04:42] (03PS2) 10Dzahn: exim/fundraising: barium -> civi1001, donate mails to civicrm [puppet] - 10https://gerrit.wikimedia.org/r/348158 (https://phabricator.wikimedia.org/T162952) [22:06:20] (03CR) 10Mobrovac: "PCC OK - https://puppet-compiler.wmflabs.org/6153/" [puppet] - 10https://gerrit.wikimedia.org/r/348154 (https://phabricator.wikimedia.org/T107914) (owner: 10Mobrovac) [22:06:39] (03CR) 10Dzahn: "host civi1001.frack.eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/348158 (https://phabricator.wikimedia.org/T162952) (owner: 10Dzahn) [22:07:23] (03Merged) 10jenkins-bot: Initial configuration for pa.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348148 (https://phabricator.wikimedia.org/T149522) (owner: 10Dereckson) [22:08:32] pa.wikisource live on mwdebug1002 and terbium, we can proceed [22:10:35] db ok [22:10:54] (03CR) 10jenkins-bot: Initial configuration for pa.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348148 (https://phabricator.wikimedia.org/T149522) (owner: 10Dereckson) [22:12:02] !log dereckson@tin Synchronized dblists: pa.wikisource creation (T149522) (duration: 00m 41s) [22:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:09] T149522: Create Wikisource Eastern Punjabi - https://phabricator.wikimedia.org/T149522 [22:12:21] !log dereckson@tin rebuilt wikiversions.php and synchronized wikiversions files: (no justification provided) [22:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:23] !log dereckson@tin Synchronized static/images/project-logos/: Logos for pa.wikisource (T149522) (duration: 00m 41s) [22:14:25] (03CR) 10Rush: tools: job to copytruncate logs in place (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/326153 (https://phabricator.wikimedia.org/T152235) (owner: 10Rush) [22:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:48] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Initial configuration for pa.wikisource (T149522) (duration: 00m 41s) [22:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:47] PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:17:47] PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:18:19] (03Abandoned) 10Jdlrobson: Reflect change in purpose of RelatedArticlesFooterBlacklistedSkins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345758 (https://phabricator.wikimedia.org/T160076) (owner: 10Jdlrobson) [22:18:36] RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:18:36] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient [22:20:10] (03PS3) 10Jdlrobson: Remove use of blacklist for related pages feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346453 (https://phabricator.wikimedia.org/T162201) [22:20:12] (03PS1) 10Jdlrobson: Enable related pages on Vector for htwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348161 (https://phabricator.wikimedia.org/T126826) [22:24:33] Reedy: I've a pa.wikisource.org fully sync'ed, but MWMultiVersion still returns wikisource.org (sourceswiki) instead of pawikisource [22:25:47] hmmm perhaps purge the cache for the homepage [22:25:55] reedy@tin:/srv/mediawiki-staging$ mwscript eval.php pawikisource [22:25:55] no version entry for `pawikisource`. [22:26:40] Dereckson: pawikisource isn't in wikiversions.json on tin [22:26:59] Nor is it in wikiversions.php [22:27:15] (at least on tin) [22:27:58] Your branch is behind 'origin/master' by 1 commit, and can be fast-forwarded. [22:28:08] oh, understood, I must have scapped it on mwdebug1002 and terbium before git pull on tin [22:28:18] sounds like it :) [22:28:27] so I can resync the whole [22:29:32] Aye [22:30:17] !log dereckson@tin Synchronized dblists: pa.wikisource creation (take two) (duration: 00m 41s) [22:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:53] !log dereckson@tin rebuilt wikiversions.php and synchronized wikiversions files: pa.wikisource creation (take two) [22:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:40] !log dereckson@tin Synchronized w/static/images/project-logos/: pa.wikisource creation (take two) (duration: 00m 40s) [22:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:01] Ah works better :) [22:32:22] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: pa.wikisource creation (take two) (duration: 00m 41s) [22:32:24] heh [22:32:26] Dereckson: just a gentle reminder, I forgot to add the wbwikimedia to the calendar. Would you create it? [22:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:27] Urbanecm: still some small things to do for pa.wikisource first [22:34:03] Urbanecm: but yes, you can add it to the calendar [22:34:30] Dereckson: okay, I'll add it there. [22:36:23] Added [22:36:23] (03PS3) 10Dereckson: Add logos for wbwikimedia to the filesystem [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348122 (https://phabricator.wikimedia.org/T162510) (owner: 10Urbanecm) [22:37:23] (03CR) 10Dereckson: [C: 032] Add logos for wbwikimedia to the filesystem [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348122 (https://phabricator.wikimedia.org/T162510) (owner: 10Urbanecm) [22:38:14] Urbanecm: ah yes, there were still a concern about bureaucrats desysop [22:38:42] (03Merged) 10jenkins-bot: Add logos for wbwikimedia to the filesystem [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348122 (https://phabricator.wikimedia.org/T162510) (owner: 10Urbanecm) [22:38:58] Dereckson: some other *wikimedia wikis allows this so I have no objection but you decide as you deploy. [22:39:56] They don't seem concerned or bothered on the task someone could come after them, be elected president and want to cleanup the old team. [22:40:03] So whatever... [22:40:24] (03CR) 10jenkins-bot: Add logos for wbwikimedia to the filesystem [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348122 (https://phabricator.wikimedia.org/T162510) (owner: 10Urbanecm) [22:41:32] Urbanecm: should also be included in securepollglobal.dblist [22:46:08] (03PS2) 10Dereckson: Initial configuration for wbwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347214 (https://phabricator.wikimedia.org/T162510) (owner: 10Urbanecm) [22:46:34] (03Draft1) 10Paladox: Phabricator: Update nrpe command for checking if phd is running [puppet] - 10https://gerrit.wikimedia.org/r/348165 [22:46:36] (03PS2) 10Paladox: Phabricator: Update nrpe command for checking if phd is running [puppet] - 10https://gerrit.wikimedia.org/r/348165 [22:47:11] !log dereckson@tin Synchronized static/images/project-logos/: Logos for wb.wikimedia (T162510) (duration: 00m 41s) [22:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:19] T162510: Create fishbowl wiki including a blog space for West Bengal Wikimedians User Group - https://phabricator.wikimedia.org/T162510 [22:47:28] (03CR) 10Dereckson: "PS2: +securepollglobal.dblist and rebased" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347214 (https://phabricator.wikimedia.org/T162510) (owner: 10Urbanecm) [22:48:10] (03CR) 10Dereckson: [C: 032] Initial configuration for wbwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347214 (https://phabricator.wikimedia.org/T162510) (owner: 10Urbanecm) [22:52:54] (03PS3) 10Dereckson: Initial configuration for wbwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347214 (https://phabricator.wikimedia.org/T162510) (owner: 10Urbanecm) [22:53:28] (03CR) 10Dereckson: [C: 032] Initial configuration for wbwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347214 (https://phabricator.wikimedia.org/T162510) (owner: 10Urbanecm) [22:53:37] 06Operations, 10puppet-compiler: hosts with puppet compiler failures on every run - https://phabricator.wikimedia.org/T162949#3181656 (10Dzahn) [22:54:19] 06Operations, 10puppet-compiler: hosts with puppet compiler failures on every run - https://phabricator.wikimedia.org/T162949#3181168 (10Dzahn) [22:54:48] (03Merged) 10jenkins-bot: Initial configuration for wbwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347214 (https://phabricator.wikimedia.org/T162510) (owner: 10Urbanecm) [22:54:59] (03CR) 10jenkins-bot: Initial configuration for wbwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347214 (https://phabricator.wikimedia.org/T162510) (owner: 10Urbanecm) [22:57:02] (03CR) 10Niharika29: [C: 032] Remove deprecated config option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347521 (owner: 10Jdlrobson) [22:57:51] Hey. I'm deploying today (under Roan's watch). [22:58:02] Niharika: okay, but not right now [22:58:15] Dereckson: Yep. Waiting. [22:58:21] Sorry for that^^^ [22:58:25] Is the wiki creation running over time? [22:58:28] new wikis! [22:58:35] !log dereckson@tin Synchronized dblists: Create wb.wikimedia.org (T162510) (duration: 00m 41s) [22:58:37] I'm still deploying wb.wikimedia (3-10 minutes remaining if all goes well) [22:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:41] T162510: Create fishbowl wiki including a blog space for West Bengal Wikimedians User Group - https://phabricator.wikimedia.org/T162510 [22:59:00] RoanKattouw: yes, slightly [22:59:02] OK [22:59:08] !log dereckson@tin rebuilt wikiversions.php and synchronized wikiversions files: Create wb.wikimedia.org (T162510) [22:59:09] :) [22:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:16] party got started early! :) [22:59:23] jouncebot: now [22:59:23] For the next 0 hour(s) and 0 minute(s): Wiki creation (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170413T2100) [22:59:54] !log dereckson@tin Synchronized multiversion/MWMultiVersion.php: Add wb.wikimedia.org to wikimedia.org domains to serve as wikis (T162510) (duration: 00m 40s) [23:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170413T2300). Please do the needful. [23:00:04] Smalyshev, Jdlrobson, and RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:08] !log Create Translate extension tables for wb.wikimedia (T162510) [23:00:13] here [23:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:14] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Initial configurationfor wb.wikimedia.org (T162510) (duration: 00m 40s) [23:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:59] !log Create local-multiwrite stores for wb.wikimedia (T162510) [23:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:29] Niharika: I prepare the interwiki map for our two new wikis, and you've the tiller :) [23:02:36] SMalyshev: just incase you missed the memo wiki creation is running a bit over so swat should start soon [23:02:46] :) [23:02:47] sure, thanks [23:03:36] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 646 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3006458 keys, up 21 days 6 hours - replication_delay is 646 [23:06:24] (03PS1) 10Dereckson: Interwiki map update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348170 (https://phabricator.wikimedia.org/T149522) [23:06:41] (03CR) 10Dereckson: [C: 032] Interwiki map update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348170 (https://phabricator.wikimedia.org/T149522) (owner: 10Dereckson) [23:07:10] RoanKattouw and Niharika > you'll find Zuul is a little slow today [23:07:25] Little is an understatement. [23:07:29] Yup :/ I just complained about that at length in #wikimedia-releng [23:07:38] One of my patches in this SWAT took half an hour to get its unit tests run [23:07:40] Niharika: RoanKattouw operations/puppets are given priority [23:08:00] (03Merged) 10jenkins-bot: Interwiki map update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348170 (https://phabricator.wikimedia.org/T149522) (owner: 10Dereckson) [23:08:20] There's no puppet activity, so that's irrelevant [23:08:45] mediawiki-config has also be given top pri too [23:09:05] !log dereckson@tin Synchronized wmf-config/interwiki.php: DMOZ, pa.wikisource and wb.wikimedia interwiki map update (duration: 00m 41s) [23:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:15] at least gate and submit is currently empty [23:09:22] Okay I'm done :) [23:09:28] Yay. [23:09:29] Niharika: up to you [23:09:33] welcome to SWAT :) [23:09:39] yay [23:09:42] (03CR) 10jenkins-bot: Interwiki map update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348170 (https://phabricator.wikimedia.org/T149522) (owner: 10Dereckson) [23:09:44] (03CR) 10Niharika29: [C: 032] Remove use of blacklist for related pages feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346453 (https://phabricator.wikimedia.org/T162201) (owner: 10Jdlrobson) [23:11:19] Urbanecm: you know who wants to be wb.wikimedia bureaucrat? [23:11:41] (03PS2) 10Niharika29: Remove deprecated config option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347521 (owner: 10Jdlrobson) [23:11:52] (03CR) 10Niharika29: [V: 032 C: 032] Remove deprecated config option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347521 (owner: 10Jdlrobson) [23:12:11] (03CR) 10Niharika29: Remove deprecated config option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347521 (owner: 10Jdlrobson) [23:12:17] "Also, please add Me ( Wikimedia global user "jayantanth" and "Bodhisattwa") as sysops&bureaucrats." [23:12:19] (03CR) 10Niharika29: [C: 032] Remove deprecated config option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347521 (owner: 10Jdlrobson) [23:13:00] I'll need it (348115) on terbium to test it though (since I need it for command-line script) [23:13:12] (03PS4) 10Niharika29: Remove use of blacklist for related pages feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346453 (https://phabricator.wikimedia.org/T162201) (owner: 10Jdlrobson) [23:14:13] (03PS1) 10Dzahn: contint/icinga: make jenkins service monitoring configurable [puppet] - 10https://gerrit.wikimedia.org/r/348171 (https://phabricator.wikimedia.org/T162822) [23:14:36] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2988674 keys, up 21 days 6 hours - replication_delay is 0 [23:14:40] Niharika: this step is simple, you can `ssh terbium scap pull` [23:14:50] (like you would do with mwdebug1002) [23:17:09] (03Merged) 10jenkins-bot: Remove deprecated config option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347521 (owner: 10Jdlrobson) [23:17:22] 06Operations, 10puppet-compiler: hosts with puppet compiler failures on every run - https://phabricator.wikimedia.org/T162949#3181771 (10Dzahn) [23:17:23] (03CR) 10jenkins-bot: Remove deprecated config option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347521 (owner: 10Jdlrobson) [23:17:44] jdlrobson: Duplicate patches? [23:18:16] jdlrobson: Specifically: [23:18:20] [config] 346453 Remove use of blacklist for related pages feature [23:18:20] [config] 346453 Add related pages to desktop ht wiki [23:18:28] One of those is probably not meant to be 346453? [23:18:31] oh whoops [23:18:35] let me grab correct url [23:19:02] https://gerrit.wikimedia.org/r/348161 [23:19:06] !log Create account for Jayantanth on wb.wikimedia (bureaucrat) [23:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:13] ^ RoanKattouw Niharika [23:19:17] jdlrobson: Could you fix the wiki page? [23:19:26] RoanKattouw: on it [23:20:35] jdlrobson: The first one of your patches is on mwdebug1002. [23:21:36] [config] 346453 Remove use of blacklist for related pages feature ? [23:21:57] No, [config] 347521 Remove deprecated config option [23:22:16] oh ok. there's nothing to test there [23:22:19] that's just cleanup [23:22:30] OK [23:22:52] SMalyshev: Your stuff is on terbium. [23:22:54] 06Operations, 10DBA, 10Icinga, 10Monitoring: tendril cert expiry alerts on dbmonitor hosts - https://phabricator.wikimedia.org/T162183#3181800 (10Dzahn) [23:23:02] Niharika: thanks, checking [23:23:39] Niharika: ok, everything is fine [23:26:17] SMalyshev: Syncing everywhere... [23:26:25] Niharika: thanks [23:26:59] !log niharika29@tin Synchronized php-1.29.0-wmf.20/extensions/CirrusSearch/: Revert Workaround OOM issue on ngrams field (duration: 00m 54s) [23:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:05] (03CR) 10Niharika29: [C: 032] Enable related pages on Vector for htwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348161 (https://phabricator.wikimedia.org/T126826) (owner: 10Jdlrobson) [23:29:16] (03CR) 10jerkins-bot: [V: 04-1] Enable related pages on Vector for htwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348161 (https://phabricator.wikimedia.org/T126826) (owner: 10Jdlrobson) [23:29:50] 06Operations, 10ops-codfw: setup naos/WMF6406 as new codfw deployment server - https://phabricator.wikimedia.org/T162900#3181822 (10RobH) [23:29:56] (03PS2) 10Niharika29: Enable related pages on Vector for htwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348161 (https://phabricator.wikimedia.org/T126826) (owner: 10Jdlrobson) [23:30:34] (03PS5) 10Niharika29: Remove use of blacklist for related pages feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346453 (https://phabricator.wikimedia.org/T162201) (owner: 10Jdlrobson) [23:30:47] (03CR) 10Niharika29: Remove use of blacklist for related pages feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346453 (https://phabricator.wikimedia.org/T162201) (owner: 10Jdlrobson) [23:30:50] (03CR) 10jerkins-bot: [V: 04-1] Enable related pages on Vector for htwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348161 (https://phabricator.wikimedia.org/T126826) (owner: 10Jdlrobson) [23:30:53] (03CR) 10Niharika29: [C: 032] Remove use of blacklist for related pages feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346453 (https://phabricator.wikimedia.org/T162201) (owner: 10Jdlrobson) [23:32:01] (03Merged) 10jenkins-bot: Remove use of blacklist for related pages feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346453 (https://phabricator.wikimedia.org/T162201) (owner: 10Jdlrobson) [23:32:16] (03CR) 10jenkins-bot: Remove use of blacklist for related pages feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346453 (https://phabricator.wikimedia.org/T162201) (owner: 10Jdlrobson) [23:32:32] (03CR) 10Niharika29: Enable related pages on Vector for htwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348161 (https://phabricator.wikimedia.org/T126826) (owner: 10Jdlrobson) [23:32:36] (03PS3) 10Niharika29: Enable related pages on Vector for htwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348161 (https://phabricator.wikimedia.org/T126826) (owner: 10Jdlrobson) [23:32:40] (03CR) 10Niharika29: [C: 032] Enable related pages on Vector for htwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348161 (https://phabricator.wikimedia.org/T126826) (owner: 10Jdlrobson) [23:32:46] (03PS1) 10Dzahn: tendril: skip cert monitoring where Letsencrypt is disabled [puppet] - 10https://gerrit.wikimedia.org/r/348172 (https://phabricator.wikimedia.org/T162183) [23:33:03] was there a SSH fingerprint change for bast4001? [23:33:05] (03PS2) 10Dzahn: contint/icinga: make jenkins service monitoring configurable [puppet] - 10https://gerrit.wikimedia.org/r/348171 (https://phabricator.wikimedia.org/T162822) [23:33:18] i'm getting a warning + "The fingerprint for the ECDSA key sent by the remote host is [23:33:18] 1a:fc:99:62:73:6d:ef:e0:7d:60:7d:bd:19:00:fc:1f." [23:33:39] HaeB: that's not bast4001.. uhm [23:33:43] jdlrobson: Your "Remove use of blacklist for related pages feature" patch is on mwdebug1002 if you wanna test it. [23:33:46] (03Merged) 10jenkins-bot: Enable related pages on Vector for htwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348161 (https://phabricator.wikimedia.org/T126826) (owner: 10Jdlrobson) [23:34:02] ...not the one at https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/bast4001.wikimedia.org [23:34:04] (03PS1) 10Kaldari: Use new pageassessments dblist to avoid cronspam [puppet] - 10https://gerrit.wikimedia.org/r/348173 (https://phabricator.wikimedia.org/T159438) [23:34:15] (03CR) 10jenkins-bot: Enable related pages on Vector for htwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348161 (https://phabricator.wikimedia.org/T126826) (owner: 10Jdlrobson) [23:34:26] jdlrobson: And your last one as well. [23:34:39] sweettt on it! [23:35:01] ECDSA key fingerprint is SHA256:UUsRMiUK9CkPg8yMiEPAKjs1PEhKxQPT+xhi4xRnjks. [23:35:07] For me, it works [23:35:12] HaeB: the one on wiki is still correct [23:35:28] | ECDSA | SHA-256 | UUsRMiUK9CkPg8yMiEPAKjs1PEhKxQPT+xhi4xRnjks= | [23:35:31] (woks here = with a fingerprint coherent with the wiki one) [23:35:36] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 637 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2989231 keys, up 21 days 7 hours - replication_delay is 637 [23:36:08] mutante: and my known_hosts file hasn't changed in the last two weeks or so [23:36:18] kaldari: You wanna test AutoblockList on mwdebug1002? [23:36:23] HaeB: are you connecting to something behind bast4001 or really bast4001 itself [23:36:27] yes [23:36:28] HaeB: when you ping, you have 198.35.26.5? [23:36:42] to stat1004 [23:36:46] Niharika: all is good! Please sync! [23:36:50] jdlrobson: Cool! [23:37:07] Dereckson: yes [23:37:28] ...for ping bast4001.wikimedia.org [23:37:43] HaeB: what you are seeing is the fingerprint of stat1004 [23:37:49] and that is the host that changed [23:38:30] Niharika: Looks good except that the i18n messages are broken, but that's normal until the localization caches are rebuilt by the scap script. [23:38:31] hmm https://wikitech.wikimedia.org/w/index.php?title=Help:SSH_Fingerprints/stat1004.eqiad.wmnet&action=history [23:38:45] 13:54 ottomata: reimaging stat1004 as jessie [23:38:58] 2 days ago [23:39:16] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Enable related pages on Vector for htwiki (T126826) (duration: 00m 41s) [23:39:17] mutante: cool, thanks for clearing that up! [23:39:17] kaldari: Okay. [23:39:18] Niharika: In other words, you'll need to do a full scap for this deployment rather than just a sync-file or sync-dir [23:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:23] T126826: Remove Related Pages from desktop web beta (most wikis) - https://phabricator.wikimedia.org/T126826 [23:39:25] Right. [23:39:42] (03PS1) 10Dereckson: Fix Abuse Filter configuration for tr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348174 (https://phabricator.wikimedia.org/T161960) [23:39:59] HaeB: welcome, so yea, that looks correct, thought that wiki page shoudl have been updated [23:40:22] mutante: should we file a phab task for that? (it looks ottomata isn't online right now) [23:41:15] (I looked at https://wikitech.wikimedia.org/wiki/Special:RecentChanges before asking here, so would likely have seen the stat1004 change..) [23:41:17] HaeB: i was about to say "let's just comment on the re-isntall task" but .. i dont see an obvious one ... hrmm [23:41:49] yea, agree [23:41:54] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Remove use of blacklist for related pages feature (T162201) (duration: 00m 40s) [23:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:02] T162201: Cleanup artifacts of related pages desktop beta feature - https://phabricator.wikimedia.org/T162201 [23:42:06] it was only in SAL and you dont see that in RC [23:42:22] so you'll have to search for the host name in https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:25] to see it [23:42:36] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2988087 keys, up 21 days 7 hours - replication_delay is 7 [23:43:10] !log niharika29@tin Synchronized wmf-config/CommonSettings.php: Remove use of blacklist for related pages feature (T162201) (duration: 00m 40s) [23:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:30] mutante: i probably wouldn't have known that reimaging changes the fingerprint anyway ;) [23:44:01] mutante: in any case, updating https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/stat1004.eqiad.wmnet should be a no-brainer imho [23:44:06] yea, there should be a ticket for that re-imaging and probably a mail to a list [23:44:27] if it has shell users like that [23:44:35] ok i can file a task just for the fingerprint - i'm sure andrew can merge if necessary [23:44:48] ok [23:45:42] (03PS1) 10Niharika29: Revert "Remove use of blacklist for related pages feature" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348175 [23:46:02] jdlrobson: For https://gerrit.wikimedia.org/r/#/c/348175/ you didn't change it in CommonSettings.php.. [23:46:05] Reverting for now. [23:46:27] jdlrobson: It was spewing giant amounts of log spam like: [23:46:28] Apr 13 23:44:16 mw1200: #012Notice: Undefined variable: wmgRelatedArticlesFooterBlacklistedSkins in /srv/mediawiki/wmf-config/CommonSettings.php on line 2878 [23:46:40] (03PS2) 10Niharika29: Revert "Remove use of blacklist for related pages feature" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348175 [23:46:45] ah damn [23:46:47] (03CR) 10Niharika29: [C: 032] Revert "Remove use of blacklist for related pages feature" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348175 (owner: 10Niharika29) [23:46:48] can i fix that up now? [23:46:50] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/6155/" [puppet] - 10https://gerrit.wikimedia.org/r/348171 (https://phabricator.wikimedia.org/T162822) (owner: 10Dzahn) [23:46:55] is there time? [23:46:57] sorry.. [23:48:01] jdlrobson: Yeah, sure. [23:48:06] i hate the wmg prefix and its misuse...!! :) [23:48:48] (03PS1) 10Jdlrobson: Remove use of blacklist for related pages feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348176 (https://phabricator.wikimedia.org/T162201) [23:48:58] ^ Niharika the new patch is there. Sorry about that... :/ [23:48:58] (03Merged) 10jenkins-bot: Revert "Remove use of blacklist for related pages feature" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348175 (owner: 10Niharika29) [23:49:05] 06Operations, 10ops-codfw: Swap NIC on mira - https://phabricator.wikimedia.org/T162859#3181921 (10faidon) [23:49:07] (03CR) 10jenkins-bot: Revert "Remove use of blacklist for related pages feature" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348175 (owner: 10Niharika29) [23:49:07] 06Operations, 10hardware-requests: spare pool allocation of WMF6406 to replace mira - https://phabricator.wikimedia.org/T162897#3181919 (10faidon) 05Open>03Resolved Yup, that's fine. [23:49:14] "will be moderated"? [23:50:17] 06Operations, 10ops-eqiad, 10Phabricator, 06Release-Engineering-Team, 10hardware-requests: replacement hardware for iridium (phabricator) - https://phabricator.wikimedia.org/T156970#3181938 (10faidon) a:05faidon>03RobH That allocation sounds temporary, right? Sounds fine. [23:51:01] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Revert Remove use of blacklist for related pages feature (T162201) (duration: 00m 41s) [23:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:08] T162201: Cleanup artifacts of related pages desktop beta feature - https://phabricator.wikimedia.org/T162201 [23:52:38] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 607 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2988087 keys, up 21 days 7 hours - replication_delay is 607 [23:55:39] 06Operations, 10Continuous-Integration-Infrastructure, 10Icinga, 06Release-Engineering-Team, 13Patch-For-Review: remove/fix jenkins icinga monitoring on contint2001 - https://phabricator.wikimedia.org/T162822#3181965 (10Dzahn) "jenkins_service_running" is now gone on contint2001 (and fine on contint1001... [23:56:02] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Retry sync Revert Remove use of blacklist for related pages feature (T162201) (duration: 00m 40s) [23:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:09] T162201: Cleanup artifacts of related pages desktop beta feature - https://phabricator.wikimedia.org/T162201 [23:56:09] PROBLEM - dhclient process on naos is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:56:29] PROBLEM - nutcracker port on naos is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:56:39] PROBLEM - nutcracker process on naos is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:56:49] PROBLEM - puppet last run on naos is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:57:03] ^ that is a brandnew host. i dont know about it, but i _just_ saw it being added to Icinga config [23:57:09] PROBLEM - Check size of conntrack table on naos is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:57:10] PROBLEM - salt-minion processes on naos is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:57:17] 06Operations, 10ops-eqiad, 10Phabricator, 06Release-Engineering-Team, 10hardware-requests: replacement hardware for iridium (phabricator) - https://phabricator.wikimedia.org/T156970#3181972 (10RobH) a:05RobH>03faidon >>! In T156970#3181938, @faidon wrote: > That allocation sounds temporary, right? So... [23:57:20] PROBLEM - Check systemd state on naos is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:57:22] (03CR) 10Niharika29: [C: 032] Remove use of blacklist for related pages feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348176 (https://phabricator.wikimedia.org/T162201) (owner: 10Jdlrobson) [23:57:30] PROBLEM - Check the NTP synchronisation status of timesyncd on naos is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:57:49] PROBLEM - Check whether ferm is active by checking the default input chain on naos is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:57:59] PROBLEM - DPKG on naos is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:58:01] ah, mira replacement [23:58:09] PROBLEM - Disk space on naos is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:58:39] PROBLEM - Keyholder SSH agent on naos is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:58:59] PROBLEM - MD RAID on naos is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:59:04] ACKNOWLEDGEMENT - Check size of conntrack table on naos is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn https://phabricator.wikimedia.org/T162900 [23:59:04] ACKNOWLEDGEMENT - Check systemd state on naos is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn https://phabricator.wikimedia.org/T162900 [23:59:04] ACKNOWLEDGEMENT - Check the NTP synchronisation status of timesyncd on naos is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn https://phabricator.wikimedia.org/T162900 [23:59:04] ACKNOWLEDGEMENT - Check whether ferm is active by checking the default input chain on naos is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn https://phabricator.wikimedia.org/T162900 [23:59:04] ACKNOWLEDGEMENT - DPKG on naos is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn https://phabricator.wikimedia.org/T162900 [23:59:04] ACKNOWLEDGEMENT - Disk space on naos is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn https://phabricator.wikimedia.org/T162900 [23:59:04] ACKNOWLEDGEMENT - Improperly owned -0:0- files in /srv/mediawiki-staging on naos is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn https://phabricator.wikimedia.org/T162900 [23:59:05] ACKNOWLEDGEMENT - Keyholder SSH agent on naos is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn https://phabricator.wikimedia.org/T162900 [23:59:05] ACKNOWLEDGEMENT - MD RAID on naos is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn https://phabricator.wikimedia.org/T162900 [23:59:06] ACKNOWLEDGEMENT - Unmerged changes on repository mediawiki_config on naos is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn https://phabricator.wikimedia.org/T162900 [23:59:06] ACKNOWLEDGEMENT - configured eth on naos is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn https://phabricator.wikimedia.org/T162900 [23:59:07] ACKNOWLEDGEMENT - dhclient process on naos is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn https://phabricator.wikimedia.org/T162900 [23:59:07] ACKNOWLEDGEMENT - nutcracker port on naos is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn https://phabricator.wikimedia.org/T162900 [23:59:08] ACKNOWLEDGEMENT - nutcracker process on naos is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn https://phabricator.wikimedia.org/T162900 [23:59:08] ACKNOWLEDGEMENT - puppet last run on naos is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn https://phabricator.wikimedia.org/T162900 [23:59:09] ACKNOWLEDGEMENT - salt-minion processes on naos is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn https://phabricator.wikimedia.org/T162900 [23:59:31] jdlrobson: Can you update the Deployments page for realsies now? :)