[00:01:31] (03PS1) 10Dzahn: netmon1001: add missing reverse IPv6 record [dns] - 10https://gerrit.wikimedia.org/r/351222 [00:03:23] (03PS2) 10Dzahn: netmon1001: add missing reverse IPv6 record [dns] - 10https://gerrit.wikimedia.org/r/351222 [00:03:56] (03PS2) 10Dzahn: add IPv6 for netmon1002, forward and reverse records [dns] - 10https://gerrit.wikimedia.org/r/351221 (https://phabricator.wikimedia.org/T159756) [00:06:26] (03PS1) 10Tim Starling: Disable suppress_san_warnings [puppet] - 10https://gerrit.wikimedia.org/r/351223 [00:09:03] (03CR) 10jerkins-bot: [V: 04-1] Disable suppress_san_warnings [puppet] - 10https://gerrit.wikimedia.org/r/351223 (owner: 10Tim Starling) [00:09:13] (03PS3) 10Dzahn: add IPv6 for netmon1002, forward and reverse records [dns] - 10https://gerrit.wikimedia.org/r/351221 (https://phabricator.wikimedia.org/T159756) [00:11:14] (03CR) 10Dzahn: [C: 032] add IPv6 for netmon1002, forward and reverse records [dns] - 10https://gerrit.wikimedia.org/r/351221 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [00:11:57] TimStarling: it's just that it wants the "=> [00:12:00] to be aligned [00:12:18] nitpicking CI [00:12:26] yea, it's puppet-lint [00:12:54] this is why I never use arrow alignment, it generates diffs that are harder to read, changes unrelated lines [00:13:38] (03PS2) 10Tim Starling: Disable suppress_san_warnings [puppet] - 10https://gerrit.wikimedia.org/r/351223 [00:13:53] yea, i see that point about changing unrelated lines [00:14:15] we could ignore that one check, i just fixed them all at one point in the past [00:16:19] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [00:16:20] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [00:16:58] (03CR) 10Dzahn: ";; ANSWER SECTION:" [dns] - 10https://gerrit.wikimedia.org/r/351221 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [00:17:40] (03CR) 10Tim Starling: [C: 032] Disable suppress_san_warnings [puppet] - 10https://gerrit.wikimedia.org/r/351223 (owner: 10Tim Starling) [00:18:57] 06Operations, 13Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3226683 (10Dzahn) [00:19:26] ignore the check *and* reformat the entire repo to have unaligned arrows, otherwise people will keep aligning the arrows by hand [00:19:31] 06Operations, 13Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3077541 (10Dzahn) installed OS, added to puppet, accepted salt key, setup IPv6: ``` ;; ANSWER SECTION: netmon1002.wikimedia.org. 3600 IN AAAA 2620:0:861:1:208:80:154:5 -- host 2620:0:861:1:208:80:15... [00:22:00] ACKNOWLEDGEMENT - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (428815s 200000s) andrew bogott I think this is a false alert but I will investigate. [00:22:38] hmm.. that would be reversing everything we did but i see what you mean [00:24:19] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:24:20] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:24:21] it's your repo, up to you, just saying this is why there are very few aligned arrows in the projects I work on [00:25:14] !log populating production etcd with initial mediawiki config keys [00:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:27] (03PS1) 10Dzahn: puppet-lint: ignore arrow alignment [puppet] - 10https://gerrit.wikimedia.org/r/351225 [00:28:41] Somehow I'm currently having to wait sometime between 30 seconds and one minute to have Special:RecentChanges or Special:Contributions being loaded at dewiki. Normal (rendered, not Special:) pages work fine on dewiki, and enwiki special pages works fine too (haven't tested other things). [00:28:49] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: CRITICAL - Destination Unreachable (2607:f6f0:205::153) [00:29:09] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [00:29:20] ^ the "oob" part is out-of-band [00:33:17] !log tstarling@puppetmaster1001 conftool action : set/@dc-codfw.yaml; selector: name=WMFMasterDatacenter [00:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:46] !log tstarling@puppetmaster1001 conftool action : set/@read-write.yaml; selector: name=ReadOnly [00:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:59] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 57.53 ms [00:34:19] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 34.91 ms [00:35:00] (03PS2) 10Dzahn: puppet-lint: ignore arrow alignment [puppet] - 10https://gerrit.wikimedia.org/r/351225 [00:35:28] Anyway, it's 0230 here, so I'm off now. Just wanted to let you know in case anybody cares :) [00:37:13] eddiegp: thanks for reporting. it seems fast for me on de.wp though [00:41:33] (03PS3) 10Krinkle: Update interwiki map (disable __list sorting) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350899 (https://phabricator.wikimedia.org/T145337) [00:41:58] (03PS3) 10Dzahn: remove barium.frack.eqiad, keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/350113 (https://phabricator.wikimedia.org/T162952) [00:42:36] (03CR) 10Dzahn: "ok, amending. keep mgmt" [dns] - 10https://gerrit.wikimedia.org/r/350113 (https://phabricator.wikimedia.org/T162952) (owner: 10Dzahn) [00:42:43] (03PS4) 10Dzahn: remove barium.frack.eqiad, keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/350113 (https://phabricator.wikimedia.org/T162952) [00:42:59] (03CR) 10Krinkle: "Meh, will need to do next week." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350899 (https://phabricator.wikimedia.org/T145337) (owner: 10Krinkle) [00:43:49] (03CR) 10Dzahn: [C: 032] remove barium.frack.eqiad, keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/350113 (https://phabricator.wikimedia.org/T162952) (owner: 10Dzahn) [00:46:04] 06Operations, 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: decom barium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T162952#3226716 (10Dzahn) [00:47:29] 06Operations, 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: decom barium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T162952#3181286 (10Dzahn) I checked all the boxes in the description that i could check. I can confirm it's out of puppet repo, the production IP is gone. th... [00:47:53] 06Operations, 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: decom barium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T162952#3226720 (10Dzahn) [00:48:16] (03PS1) 10Tim Starling: Add $wmfMasterDatacenter to meta=siteinfo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351232 (https://phabricator.wikimedia.org/T156924) [00:49:01] (03CR) 10Dzahn: [C: 031] "@DBA's - friendly ping if you have a moment for that, can wait until after dcswitch though if you are busy" [puppet] - 10https://gerrit.wikimedia.org/r/348565 (owner: 10Dzahn) [00:51:22] 06Operations, 10netops: netmon1002 networking setup - https://phabricator.wikimedia.org/T159757#3226739 (10Dzahn) This is unblocked now since netmon1002 has been installed, and IPv6 has been configured (T159756#3226683) netmon1002.wikimedia.org has address 208.80.154.5 netmon1002.wikimedia.org has IPv6 addres... [00:52:25] (03CR) 10Krinkle: Add $wmfMasterDatacenter to meta=siteinfo (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351232 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [00:53:44] 06Operations, 13Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3226746 (10Dzahn) [00:57:11] 06Operations, 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops, 10netops: decom barium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T162952#3226748 (10Dzahn) [00:59:34] 06Operations, 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops, 10netops: decom barium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T162952#3226767 (10Dzahn) adding #netops for the "switch port" check boxes. needs access to srx550s. [00:59:34] (03PS2) 10Tim Starling: Add $wmfMasterDatacenter to meta=siteinfo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351232 (https://phabricator.wikimedia.org/T156924) [01:13:53] 06Operations: move icinga contacts file to public repo - https://phabricator.wikimedia.org/T164238#3226775 (10Dzahn) [01:14:27] 06Operations: move icinga contacts file to public repo - https://phabricator.wikimedia.org/T164238#3226787 (10Dzahn) [01:14:47] 06Operations, 10Icinga: move icinga contacts file to public repo - https://phabricator.wikimedia.org/T164238#3226775 (10Dzahn) [01:31:29] RECOVERY - configured eth on elastic2020 is OK: OK - interfaces up [01:31:39] RECOVERY - MD RAID on elastic2020 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [01:31:39] RECOVERY - DPKG on elastic2020 is OK: All packages OK [01:31:49] RECOVERY - dhclient process on elastic2020 is OK: PROCS OK: 0 processes with command name dhclient [01:31:59] RECOVERY - Disk space on elastic2020 is OK: DISK OK [01:31:59] RECOVERY - Check size of conntrack table on elastic2020 is OK: OK: nf_conntrack is 0 % full [01:32:00] RECOVERY - Check whether ferm is active by checking the default input chain on elastic2020 is OK: OK ferm input default policy is set [01:32:19] RECOVERY - salt-minion processes on elastic2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:34:39] RECOVERY - Check the NTP synchronisation status of timesyncd on elastic2020 is OK: OK: synced at Tue 2017-05-02 01:34:36 UTC. [01:51:59] RECOVERY - puppet last run on elastic2020 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [01:57:09] (03PS3) 10Krinkle: Add $wmfMasterDatacenter to meta=siteinfo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351232 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [01:57:14] (03CR) 10Krinkle: Add $wmfMasterDatacenter to meta=siteinfo (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351232 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [01:57:17] (03CR) 10Krinkle: [C: 031] Add $wmfMasterDatacenter to meta=siteinfo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351232 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [02:28:41] (03CR) 10Tim Starling: [C: 032] Add $wmfMasterDatacenter to meta=siteinfo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351232 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [02:29:52] (03Merged) 10jenkins-bot: Add $wmfMasterDatacenter to meta=siteinfo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351232 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [02:30:06] (03CR) 10jenkins-bot: Add $wmfMasterDatacenter to meta=siteinfo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351232 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [02:34:36] !log tstarling@naos Synchronized wmf-config/CommonSettings.php: siteinfo hook (duration: 02m 39s) [02:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:41:39] 06Operations, 10MediaWiki-Configuration, 06MediaWiki-Platform-Team, 06Performance-Team, and 10 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3226852 (10tstarling) >>! In T156924#3226349, @tstarling wrote: >>>! In T156924#3224786, @V... [02:45:24] !log tstarling@naos Synchronized php-1.29.0-wmf.21/includes/config/EtcdConfig.php: EtcdConfig backported bug fixes (duration: 01m 02s) [02:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:40] PROBLEM - Disk space on ocg1003 is CRITICAL: DISK CRITICAL - free space: /srv/deployment/ocg/output 10330 MB (3% inode=98%) [03:10:31] !log mattflaschen@naos Synchronized php-1.29.0-wmf.21/extensions/FlaggedRevs/: Urgent deploy: Fix FlaggedRevs fatal, and also a filter issue: T164096 and T164049 (duration: 00m 56s) [03:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:10:41] T164096: 503 "Service Unavailable" error on Special:RevisionReview - https://phabricator.wikimedia.org/T164096 [03:10:42] T164049: [Regression] Sight links in RecentChanges not longer available (when option "Hide reviewed edits" is not checked) - https://phabricator.wikimedia.org/T164049 [03:11:26] ^ greg-g [03:15:58] 06Operations, 10ops-eqiad, 10DBA: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3226868 (10Ottomata) As long as we give a couple of days heads up, I think we're fine. Pick a day andy day :) Just let me know with enough time to get an email out. [04:12:29] 06Operations, 06Performance-Team: webpagetest-alerts: Difference in size authenticated - https://phabricator.wikimedia.org/T164209#3225543 (10Peter) Let me look into what JS has triggered the alerts (and probably change the alert setting). The alerts are triggered if we have a 10% increase in size of Javascrip... [04:17:38] 06Operations, 06Performance-Team: webpagetest-alerts: Difference in size authenticated - https://phabricator.wikimedia.org/T164209#3226952 (10Peter) Increased the limits for now for mobile and will follow up later today. [04:36:11] (03PS1) 10Giuseppe Lavagetto: Fix etcd SRV records in codfw [dns] - 10https://gerrit.wikimedia.org/r/351239 [04:36:40] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix etcd SRV records in codfw [dns] - 10https://gerrit.wikimedia.org/r/351239 (owner: 10Giuseppe Lavagetto) [04:43:51] (03PS1) 10Giuseppe Lavagetto: etcd: create ecdsa cert for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/351240 [06:20:36] (03CR) 10Tim Starling: "We're just waiting for Giuseppe to get the eqiad etcd cluster ready so that this can be deployed. Suggested deployment plan:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351132 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [06:24:17] 06Operations, 06Performance-Team: HTTP responses from app servers sometimes stall for >1s - https://phabricator.wikimedia.org/T164248#3227053 (10Krinkle) [06:34:37] (03CR) 10Phuedx: [C: 04-1] "All of the entries are missing the wiki suffix, right? Whoops! Let's fix them all at the same time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351166 (https://phabricator.wikimedia.org/T164044) (owner: 10Jdlrobson) [06:41:30] (03PS1) 10Giuseppe Lavagetto: conf1*: convert eqiad cluster to use role::configcluster [puppet] - 10https://gerrit.wikimedia.org/r/351249 [06:44:44] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd: create ecdsa cert for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/351240 (owner: 10Giuseppe Lavagetto) [06:46:29] <_joe_> !log disabling etcd auth on conf1*, converting to use nginx for TLS/auth T159687 [06:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:38] T159687: etcd switchover/enhancements - https://phabricator.wikimedia.org/T159687 [06:46:48] thanks moritzm [06:52:29] (03PS1) 10Muehlenhoff: Remove access credetials for zareen [puppet] - 10https://gerrit.wikimedia.org/r/351251 [06:52:51] (03PS2) 10Muehlenhoff: Remove access credentials for zareen [puppet] - 10https://gerrit.wikimedia.org/r/351251 [06:56:05] (03CR) 10Muehlenhoff: [C: 032] Remove access credentials for zareen [puppet] - 10https://gerrit.wikimedia.org/r/351251 (owner: 10Muehlenhoff) [07:03:55] 06Operations, 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: decom barium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T162952#3227101 (10ayounsi) [07:04:24] 06Operations, 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: decom barium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T162952#3181286 (10ayounsi) Switch port cleaned up in T162950 [07:06:09] PROBLEM - puppet last run on notebook1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[enforce-users-groups-cleanup] [07:06:20] <_joe_> moritzm: ^^ [07:06:35] (03PS2) 10Giuseppe Lavagetto: conf1*: convert eqiad cluster to use role::configcluster [puppet] - 10https://gerrit.wikimedia.org/r/351249 [07:09:10] (03PS1) 10Giuseppe Lavagetto: Add snakeoil private cert for etcd [labs/private] - 10https://gerrit.wikimedia.org/r/351252 [07:09:29] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Add snakeoil private cert for etcd [labs/private] - 10https://gerrit.wikimedia.org/r/351252 (owner: 10Giuseppe Lavagetto) [07:13:07] having a look [07:13:32] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/6265/conf1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/351249 (owner: 10Giuseppe Lavagetto) [07:15:09] RECOVERY - puppet last run on notebook1001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [07:27:29] PROBLEM - Check systemd state on conf1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:32:29] (03PS2) 10Elukey: Swap mc1001->mc1012 with mc1019->mc2030 [puppet] - 10https://gerrit.wikimedia.org/r/350549 (https://phabricator.wikimedia.org/T137345) [07:36:44] <_joe_> !log starting etcd replication codfw => eqiad [07:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:29] (03CR) 10Volans: [C: 031] "LGTM, minor (also optional) comments inline" (033 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/351003 (owner: 10Giuseppe Lavagetto) [07:40:12] (03CR) 10Muehlenhoff: [C: 032] Make mediawiki-firejail-ghostscript quiet [puppet] - 10https://gerrit.wikimedia.org/r/351123 (https://phabricator.wikimedia.org/T164145) (owner: 10Gergő Tisza) [07:40:22] (03PS2) 10Muehlenhoff: Make mediawiki-firejail-ghostscript quiet [puppet] - 10https://gerrit.wikimedia.org/r/351123 (https://phabricator.wikimedia.org/T164145) (owner: 10Gergő Tisza) [07:44:08] (03CR) 10Hashar: "My previous attempt made the change directly to the .profile file https://gerrit.wikimedia.org/r/#/c/338980/1/modules/mediawiki/files/medi" [puppet] - 10https://gerrit.wikimedia.org/r/351123 (https://phabricator.wikimedia.org/T164145) (owner: 10Gergő Tisza) [07:45:50] RECOVERY - Check systemd state on conf1002 is OK: OK - running: The system is fully operational [07:46:04] (03Restored) 10Hashar: mediawiki-firejail: explicitly signal end of options [puppet] - 10https://gerrit.wikimedia.org/r/338979 (https://phabricator.wikimedia.org/T158649) (owner: 10Hashar) [07:46:14] 06Operations, 13Patch-For-Review, 07Technical-Debt: Retire Torrus - https://phabricator.wikimedia.org/T87840#1000364 (10ayounsi) Note that currently: ```$ curl -v https://torrus.wikimedia.org/``` has a 301 to "Location: https:///" [07:49:19] (03PS2) 10Hashar: mediawiki-firejail: explicitly signal end of options [puppet] - 10https://gerrit.wikimedia.org/r/338979 (https://phabricator.wikimedia.org/T158649) [07:49:48] (03PS3) 10Hashar: mediawiki-firejail: explicitly signal end of options [puppet] - 10https://gerrit.wikimedia.org/r/338979 (https://phabricator.wikimedia.org/T158649) [07:51:18] 06Operations, 10Wikimedia-General-or-Unknown, 13Patch-For-Review: Investigate why firejails break PdfHandler - https://phabricator.wikimedia.org/T164145#3223332 (10hashar) >>! In T164145#3224445, @Tgr wrote: > ... > So it seems firejail is messing up parameter parsing somehow which results in gs not finding... [07:51:20] PROBLEM - Etcd replication lag on conf1002 is CRITICAL: connect to address 10.64.32.180 and port 8000: Connection refused [07:51:50] PROBLEM - etcdmirror-conftool-codfw-wmnet service on conf1002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-codfw-wmnet is failed [07:51:50] PROBLEM - Check systemd state on conf1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:53:42] (03PS1) 10Elukey: Replace mc100[123] with mc10(19|2[01]) after hw refresh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351254 (https://phabricator.wikimedia.org/T137345) [07:53:51] (03CR) 10Volans: [C: 031] "LGTM, minor comments inline" (034 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/351004 (https://phabricator.wikimedia.org/T163337) (owner: 10Giuseppe Lavagetto) [07:58:06] 06Operations, 13Patch-For-Review, 15User-Joe: etcd switchover/enhancements - https://phabricator.wikimedia.org/T159687#3227160 (10Joe) I converted the etcd cluster in eqiad to use nginx for auth/TLS, moved to ecdsa certs with the correct SANs, and started replication codfw => eqiad. I might start to make cl... [07:58:47] !log wap mc1001->mc1012 with mc1019->mc2030 [07:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:20] RECOVERY - Etcd replication lag on conf1002 is OK: HTTP OK: HTTP/1.1 200 OK - 148 bytes in 0.074 second response time [07:59:22] !log Swap mc1001->mc1012 with mc1019->mc2030 - T137345 (more informative :) [07:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:28] T137345: Rack/Setup new memcache servers mc1019-36 - https://phabricator.wikimedia.org/T137345 [07:59:34] (03CR) 10Elukey: [C: 032] Swap mc1001->mc1012 with mc1019->mc2030 [puppet] - 10https://gerrit.wikimedia.org/r/350549 (https://phabricator.wikimedia.org/T137345) (owner: 10Elukey) [07:59:41] (03PS3) 10Elukey: Swap mc1001->mc1012 with mc1019->mc2030 [puppet] - 10https://gerrit.wikimedia.org/r/350549 (https://phabricator.wikimedia.org/T137345) [07:59:50] RECOVERY - etcdmirror-conftool-codfw-wmnet service on conf1002 is OK: OK - etcdmirror-conftool-codfw-wmnet is active [07:59:50] RECOVERY - Check systemd state on conf1002 is OK: OK - running: The system is fully operational [08:04:07] !log Upgrading Jenkins to 2.7.4 - T144106 [08:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:16] T144106: Upgrade Jenkins from 1.x to latest 2.x - https://phabricator.wikimedia.org/T144106 [08:04:37] !log Installing Jenkins plugin Pipeline: Stage View https://plugins.jenkins.io/pipeline-stage-view [08:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:20] PROBLEM - Etcd replication lag on conf1002 is CRITICAL: connect to address 10.64.32.180 and port 8000: Connection refused [08:06:20] RECOVERY - Etcd replication lag on conf1002 is OK: HTTP OK: HTTP/1.1 200 OK - 148 bytes in 0.075 second response time [08:07:34] <_joe_> this ^^ will fail again [08:07:54] <_joe_> something wrong with etcdmirror on conf1002, no idea why [08:08:16] (03CR) 10Alexandros Kosiaris: "FWIW, I am kind of indifferent on this one these days. vim does this automatically for me, git hooks forbid me from pushing non aligned ar" [puppet] - 10https://gerrit.wikimedia.org/r/351225 (owner: 10Dzahn) [08:08:50] PROBLEM - puppet last run on mc1020 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[redis-instance-tcp_6379] [08:09:28] first puppet run fails for some reason, second one succeeds [08:10:51] RECOVERY - puppet last run on mc1020 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [08:10:55] (03CR) 10Jcrespo: "Not now- aside from merging, this has to deply, and we shouldn't be distracted with misc services right now." [puppet] - 10https://gerrit.wikimedia.org/r/348565 (owner: 10Dzahn) [08:11:50] PROBLEM - etcdmirror-conftool-codfw-wmnet service on conf1002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-codfw-wmnet is failed [08:11:59] <_joe_> elukey: it's a race condition between redis and the confd-managed files [08:12:00] PROBLEM - Check systemd state on conf1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:12:02] <_joe_> we cna fix it [08:12:11] yeah [08:12:20] PROBLEM - Etcd replication lag on conf1002 is CRITICAL: connect to address 10.64.32.180 and port 8000: Connection refused [08:14:02] (03CR) 10Alexandros Kosiaris: [C: 032] netmon1001: add missing reverse IPv6 record [dns] - 10https://gerrit.wikimedia.org/r/351222 (owner: 10Dzahn) [08:14:06] (03PS3) 10Alexandros Kosiaris: netmon1001: add missing reverse IPv6 record [dns] - 10https://gerrit.wikimedia.org/r/351222 (owner: 10Dzahn) [08:14:09] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] netmon1001: add missing reverse IPv6 record [dns] - 10https://gerrit.wikimedia.org/r/351222 (owner: 10Dzahn) [08:14:35] 06Operations, 10fundraising-tech-ops: Port fundraising stats off Ganglia - https://phabricator.wikimedia.org/T152562#3227173 (10fgiunchedi) @Jgreen any news/updates on having FR fully on jessie? [08:14:42] !log Installing Jenkins Pipeline plugin [08:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:51] all right mc1019->mc1036 are good, running puppet in the codfw ones [08:22:03] 06Operations, 10netops: netmon1002 networking setup - https://phabricator.wikimedia.org/T159757#3227188 (10ayounsi) 05Open>03Resolved a:03ayounsi I don't see the IP or hostname of netmon1001 hardcoded in network devices. DNS records for librenms.wikimedia.org will have to point to the new IP as it's the... [08:22:05] 06Operations, 13Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3227194 (10ayounsi) [08:22:20] PROBLEM - IPsec on mc1020 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2020_v4 [08:23:20] RECOVERY - IPsec on mc1020 is OK: Strongswan OK - 1 ESP OK [08:23:43] thanks [08:26:07] !log Upgrading Jenkins to 2.19.4 - T144106 [08:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:16] T144106: Upgrade Jenkins from 1.x to latest 2.x - https://phabricator.wikimedia.org/T144106 [08:28:00] PROBLEM - IPsec on mc1022 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2022_v4 [08:30:00] RECOVERY - IPsec on mc1022 is OK: Strongswan OK - 1 ESP OK [08:30:00] PROBLEM - IPsec on mc1030 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2030_v4 [08:30:17] 06Operations, 10Monitoring, 13Patch-For-Review, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Evaluate prometheus snmp_exporter for Torrus PDUs metrics use case - https://phabricator.wikimedia.org/T148541#3227205 (10fgiunchedi) a:03fgiunchedi [08:31:00] RECOVERY - IPsec on mc1030 is OK: Strongswan OK - 1 ESP OK [08:31:11] 06Operations, 10Monitoring, 13Patch-For-Review, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Replace Torrus with Prometheus snmp_exporter for PDUs monitoring - https://phabricator.wikimedia.org/T148541#2725758 (10fgiunchedi) [08:32:36] !log stop and mask redis on mc1001-mc1018 - T137345 [08:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:45] T137345: Rack/Setup new memcache servers mc1019-36 - https://phabricator.wikimedia.org/T137345 [08:33:28] !log Upgrading Jenkins to 2.32.3 - T144106 [08:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:36] T144106: Upgrade Jenkins from 1.x to latest 2.x - https://phabricator.wikimedia.org/T144106 [08:36:00] RECOVERY - Check systemd state on conf1002 is OK: OK - running: The system is fully operational [08:36:15] (03PS1) 10Giuseppe Lavagetto: profile::etcd::tlsproxy: turn off proxy buffering [puppet] - 10https://gerrit.wikimedia.org/r/351257 (https://phabricator.wikimedia.org/T159687) [08:36:20] RECOVERY - Etcd replication lag on conf1002 is OK: HTTP OK: HTTP/1.1 200 OK - 148 bytes in 0.077 second response time [08:36:21] RECOVERY - etcdmirror-conftool-codfw-wmnet service on conf1002 is OK: OK - etcdmirror-conftool-codfw-wmnet is active [08:40:14] !log run puppet and restart nutcracker on eqiad hosts with profile::mediawiki::nutcracker [08:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:29] !log Upgrading Jenkins to 2.46.2 - T144106 [08:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:38] T144106: Upgrade Jenkins from 1.x to latest 2.x - https://phabricator.wikimedia.org/T144106 [08:42:00] PROBLEM - Check systemd state on conf1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:42:21] PROBLEM - Etcd replication lag on conf1002 is CRITICAL: connect to address 10.64.32.180 and port 8000: Connection refused [08:42:21] PROBLEM - etcdmirror-conftool-codfw-wmnet service on conf1002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-codfw-wmnet is failed [08:43:31] (03CR) 10Alexandros Kosiaris: [C: 031] profile::etcd::tlsproxy: turn off proxy buffering [puppet] - 10https://gerrit.wikimedia.org/r/351257 (https://phabricator.wikimedia.org/T159687) (owner: 10Giuseppe Lavagetto) [08:44:00] 10Blocked-on-Operations, 06Operations, 10Graphite, 06WMDE-Analytics-Engineering, and 3 others: scale graphite deployment (tracking) - https://phabricator.wikimedia.org/T85451#3227236 (10fgiunchedi) @Dzahn indeed, and there's ~500G left on the vg still. I'll debug this with @Eevans but I suspect it is relat... [08:45:04] 06Operations, 10netops: netmon1002 networking setup - https://phabricator.wikimedia.org/T159757#3227240 (10akosiaris) [08:47:01] (03CR) 10Filippo Giunchedi: [C: 031] Reading Web Page Previews alerts [puppet] - 10https://gerrit.wikimedia.org/r/350377 (owner: 10Phuedx) [08:47:06] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::etcd::tlsproxy: turn off proxy buffering [puppet] - 10https://gerrit.wikimedia.org/r/351257 (https://phabricator.wikimedia.org/T159687) (owner: 10Giuseppe Lavagetto) [08:48:16] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Request to add phuedx to "researchers" group - https://phabricator.wikimedia.org/T164060#3227242 (10phuedx) Thanks y'all. I can confirm that `ssh notebook1001.eqiad.wmnet` WFM 👍 [08:49:20] RECOVERY - Etcd replication lag on conf1002 is OK: HTTP OK: HTTP/1.1 200 OK - 148 bytes in 0.073 second response time [08:49:21] RECOVERY - etcdmirror-conftool-codfw-wmnet service on conf1002 is OK: OK - etcdmirror-conftool-codfw-wmnet is active [08:50:00] RECOVERY - Check systemd state on conf1002 is OK: OK - running: The system is fully operational [08:52:25] (03PS1) 10Alexandros Kosiaris: Fix netmon1001's IPv6 address [dns] - 10https://gerrit.wikimedia.org/r/351258 [08:52:35] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Fix netmon1001's IPv6 address [dns] - 10https://gerrit.wikimedia.org/r/351258 (owner: 10Alexandros Kosiaris) [08:53:11] (03CR) 10Filippo Giunchedi: "> I'd consider piping the 5xx logs into Kafka from kafkatee, rather" [puppet] - 10https://gerrit.wikimedia.org/r/350817 (https://phabricator.wikimedia.org/T149451) (owner: 10Filippo Giunchedi) [08:53:19] (03CR) 10Gergő Tisza: "For utilities managing unknown command line parameters it is customary to take the first non-option parameter as the end of options, and f" [puppet] - 10https://gerrit.wikimedia.org/r/338979 (https://phabricator.wikimedia.org/T158649) (owner: 10Hashar) [08:54:16] (03PS3) 10Filippo Giunchedi: grafana: break down HTTP 499 in swift [puppet] - 10https://gerrit.wikimedia.org/r/350556 [08:54:59] (03PS1) 10Alexandros Kosiaris: Remove ganglia aggregator from netmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/351260 [08:56:16] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] grafana: break down HTTP 499 in swift [puppet] - 10https://gerrit.wikimedia.org/r/350556 (owner: 10Filippo Giunchedi) [08:58:02] (03PS1) 10Hashar: jenkins: remove groovy init that disabled CLI [puppet] - 10https://gerrit.wikimedia.org/r/351261 [08:58:35] (03PS2) 10Elukey: nutcracker: listen on localhost for stats [puppet] - 10https://gerrit.wikimedia.org/r/324642 (https://phabricator.wikimedia.org/T111934) (owner: 10Filippo Giunchedi) [09:01:05] (03Abandoned) 10Hashar: Support Jenkins install from 'experimental' component [puppet] - 10https://gerrit.wikimedia.org/r/336408 (https://phabricator.wikimedia.org/T157429) (owner: 10Hashar) [09:01:11] I don't get why there's wikibugs___ but wikibugs too [09:01:32] tool running twice on labs? [09:02:05] 06Operations, 10Page-Previews, 06Performance-Team, 06Reading-Web-Backlog, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#3227302 (10Gilles) We don't make that distinction. The time it takes for something to come up // is// performance. It's a... [09:03:20] godog: but they don't log twice the same messages... 2 workers of a queue? :-P [09:03:58] volans: yeah looks like it, no idea how the gerrit notification thing actually works [09:04:20] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 07Jenkins: Upload Jenkins LTS v2.46.2 to jessie-wikimedia/third-party - https://phabricator.wikimedia.org/T157429#3227306 (10hashar) [09:06:25] (03PS2) 10Hashar: jenkins: remove groovy init that disabled CLI [puppet] - 10https://gerrit.wikimedia.org/r/351261 [09:07:13] (03PS1) 10Alexandros Kosiaris: Remove torrus.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/351265 (https://phabricator.wikimedia.org/T87840) [09:07:45] wow, end of an era :-) [09:09:38] (03CR) 10Giuseppe Lavagetto: Add stage for restarting Redis. (033 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/351004 (https://phabricator.wikimedia.org/T163337) (owner: 10Giuseppe Lavagetto) [09:09:45] (03CR) 10Giuseppe Lavagetto: Move most redis code to a library (033 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/351003 (owner: 10Giuseppe Lavagetto) [09:11:20] godog: ready to merge https://gerrit.wikimedia.org/r/#/c/324642 ? [09:11:35] (03PS2) 10Giuseppe Lavagetto: Move most redis code to a library [switchdc] - 10https://gerrit.wikimedia.org/r/351003 [09:11:37] (03PS4) 10Giuseppe Lavagetto: Add stage for restarting Redis. [switchdc] - 10https://gerrit.wikimedia.org/r/351004 (https://phabricator.wikimedia.org/T163337) [09:13:28] (03CR) 10Volans: [C: 031] "LGTM" [switchdc] - 10https://gerrit.wikimedia.org/r/351004 (https://phabricator.wikimedia.org/T163337) (owner: 10Giuseppe Lavagetto) [09:13:30] elukey: not really, nutcracker will restart when /etc/default/nutcracker changes :( [09:14:09] ahhh sorry you are right, that one is not the config file [09:14:10] RECOVERY - Elasticsearch HTTPS on elastic2020 is OK: SSL OK - Certificate elastic2020.codfw.wmnet valid until 2022-05-01 09:13:09 +0000 (expires in 1824 days) [09:14:10] !log OpenStack / wmflabs fails to create new instances [09:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:38] (03PS3) 10Giuseppe Lavagetto: Move most redis code to a library [switchdc] - 10https://gerrit.wikimedia.org/r/351003 [09:14:40] (03PS5) 10Giuseppe Lavagetto: Add stage for restarting Redis. [switchdc] - 10https://gerrit.wikimedia.org/r/351004 (https://phabricator.wikimedia.org/T163337) [09:15:20] (03CR) 10Volans: [C: 031] "LGTM" [switchdc] - 10https://gerrit.wikimedia.org/r/351003 (owner: 10Giuseppe Lavagetto) [09:16:26] !log Stopping Nodepool [09:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:30] PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [09:19:30] PROBLEM - Check systemd state on labnodepool1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:19:50] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [09:21:14] !log Starting Nodepool [09:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:30] RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [09:21:30] RECOVERY - Check systemd state on labnodepool1001 is OK: OK - running: The system is fully operational [09:22:10] RECOVERY - Check systemd state on elastic2020 is OK: OK - running: The system is fully operational [09:23:17] elastic2020? fully operational ? lol.. let's see for how long that lasts [09:24:02] !log remove configuration from ge-8/0/0, ge-8/0/3 from asw-b-codfw for ganeti2005, ganeti2006 move to row A. T164011 [09:24:08] (03PS1) 10Jcrespo: Depool db1015 (low on space) commenting other db ongoing issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351273 [09:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:10] T164011: codfw: ganeti2007-ganeti2008 racking and onsite setup task - https://phabricator.wikimedia.org/T164011 [09:25:00] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [09:27:00] !log create interface range ganeti on asw-a-codfw. T164011 [09:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:01] !log Set description for ganeti2005, ganeti2006 on asw-a-codfw. T164011 [09:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:57] !log Nodepool can not add instances to Jenkins any more. Roll backing Jenkins to 2.32.3 [09:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:46] (03PS2) 10Jcrespo: Depool db1015 (low on space) & add db1097 & other db ongoing issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351273 [09:36:21] (03PS1) 10Alexandros Kosiaris: Assign new IPs to ganeti2005, ganeti2006 [dns] - 10https://gerrit.wikimedia.org/r/351275 (https://phabricator.wikimedia.org/T164011) [09:36:42] !log Jenkins/CI is back up! [09:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:32] (03CR) 10Jcrespo: [C: 032] Depool db1015 (low on space) & add db1097 & other db ongoing issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351273 (owner: 10Jcrespo) [09:39:37] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Move most redis code to a library [switchdc] - 10https://gerrit.wikimedia.org/r/351003 (owner: 10Giuseppe Lavagetto) [09:39:59] (03Merged) 10jenkins-bot: Depool db1015 (low on space) & add db1097 & other db ongoing issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351273 (owner: 10Jcrespo) [09:40:09] (03CR) 10jenkins-bot: Depool db1015 (low on space) & add db1097 & other db ongoing issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351273 (owner: 10Jcrespo) [09:41:01] (03PS1) 10Alexandros Kosiaris: Remove torrus role, module, varnish backend and references [puppet] - 10https://gerrit.wikimedia.org/r/351276 [09:43:07] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Add stage for restarting Redis. [switchdc] - 10https://gerrit.wikimedia.org/r/351004 (https://phabricator.wikimedia.org/T163337) (owner: 10Giuseppe Lavagetto) [09:47:14] !log jynus@naos Synchronized wmf-config/db-eqiad.php: Depool db1015 & add db1097 (duration: 01m 17s) [09:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:38] !log jynus@naos Synchronized wmf-config/db-codfw.php: Add db1097 (duration: 01m 00s) [09:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:42] <_joe_> !log testing pre-switchover the step to restart & resync redises in dc_to (eqiad) [09:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:07] !log START - Resync the redis for jobqueues in eqiad with the masters in codfw - t04_resync_redis (switchdc/oblivian@neodymium) [09:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:30] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - ocg_8000 - Could not depool server ocg1003.eqiad.wmnet because of too many down! [09:58:44] !log END (PASS) - Resync the redis for jobqueues in eqiad with the masters in codfw - t04_resync_redis (switchdc/oblivian@neodymium) [09:58:45] PROBLEM - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 {channel:frontend.error,request:{id:1493719119136-91325},error:{message:Status check failed (redis failure?)}} - 232 bytes in 0.080 second response time [09:58:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:01] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: connection error: HTTPConnectionPool(host=localhost, port=8000): Read timed out. (read timeout=5) [09:59:06] <_joe_> heh, that was me I guess [09:59:10] ah ok [09:59:10] <_joe_> damn ocg [09:59:17] <_joe_> it should be ok now [09:59:17] your great love [09:59:30] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [09:59:44] RECOVERY - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 468 bytes in 0.079 second response time [10:00:06] <_joe_> so this will happen during the switchover as well, btw [10:00:22] <_joe_> unless I switch ocg to use a different redis server [10:00:25] <_joe_> damn [10:01:03] <_joe_> actually, we're doing this all wrong :/ [10:01:20] not having killed OCG already you mean ? [10:01:26] hahaah [10:01:28] <_joe_> that of course is the root cause [10:01:45] <_joe_> akosiaris: do we have room for a VM for redis/ocg? [10:01:52] <_joe_> a small one must suffice [10:02:11] <_joe_> or, I do bring up a couple redises on ocg1002/1003 and be done with it [10:02:17] <_joe_> elukey: ^^ what do you think? [10:02:39] _joe_: we do. are you sure it's a small one ? [10:02:54] although somehow I prefer the "contain everything in ocg approach" [10:02:56] <_joe_> akosiaris: yeah reasonably small [10:03:00] <_joe_> yeah me too [10:03:11] I like the idea [10:03:25] the reason I am asking is that elukey has this idea that this is probably what is killing the redis replication [10:03:27] (03PS1) 10Giuseppe Lavagetto: Small fixes for t04_resync_redis [switchdc] - 10https://gerrit.wikimedia.org/r/351278 [10:03:37] and I am inclined to share that idea [10:03:43] <_joe_> akosiaris: that might be possible, yes [10:03:59] <_joe_> akosiaris: although, if you want to laugh and cry about redis' replication scheme [10:04:08] <_joe_> you can read a ticket I worked on last week [10:04:24] I 've been reading enough on it... [10:04:32] it's quite simplistic of course [10:04:37] yes this one is another issue, the ocg_status hash table with 600k keys.. it should fit on the ocg hosts but better to double check :D [10:04:42] <_joe_> no it's utterly broken [10:04:51] <_joe_> elukey: it definitely does [10:05:10] then I am more than happy to have redises on ocg : [10:05:11] :D [10:06:07] <_joe_> akosiaris: https://phabricator.wikimedia.org/T163337#3217044 onwards [10:06:28] <_joe_> the "laugh" part is enjoying my tale of my descent in that rabbithole, I guess [10:09:09] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Small fixes for t04_resync_redis [switchdc] - 10https://gerrit.wikimedia.org/r/351278 (owner: 10Giuseppe Lavagetto) [10:11:24] !log stopping replication on db1015 [10:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:01] !log Upgrading Jenkins to 2.46.1 - T144106 [10:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:10] T144106: Upgrade Jenkins from 1.x to latest 2.x - https://phabricator.wikimedia.org/T144106 [10:14:09] RECOVERY - Host ganeti2006 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [10:16:49] PROBLEM - Host ganeti2006 is DOWN: PING CRITICAL - Packet loss = 100% [10:18:19] RECOVERY - Host ganeti2006 is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [10:20:05] !log restart ocg on ocg1002 (localhost:8000 - frontend - not reachable) [10:20:09] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 629309 msg: ocg_render_job_queue 0 msg [10:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:23] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: ganeti2007-ganeti2008 racking and onsite setup task - https://phabricator.wikimedia.org/T164011#3227434 (10akosiaris) >>! In T164011#3221336, @Dzahn wrote: > We have lots of room in A2 and A4 and we can move into A4, but we can't move into A2 because the... [10:20:29] (03CR) 10Alexandros Kosiaris: [C: 032] Assign new IPs to ganeti2005, ganeti2006 [dns] - 10https://gerrit.wikimedia.org/r/351275 (https://phabricator.wikimedia.org/T164011) (owner: 10Alexandros Kosiaris) [10:20:48] next one: ocg1003 has filled up the /srv/deployment/ocg/output paritition [10:21:17] RECOVERY - Host ganeti2005 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [10:21:54] (03PS1) 10Volans: MediaWiki: add siteinfo check [switchdc] - 10https://gerrit.wikimedia.org/r/351279 (https://phabricator.wikimedia.org/T163398) [10:22:05] (03PS2) 10Volans: MediaWiki: add siteinfo check [switchdc] - 10https://gerrit.wikimedia.org/r/351279 (https://phabricator.wikimedia.org/T163398) [10:23:14] if it doesn't auto-clean in a bit we'll need to delete things [10:26:11] 06Operations, 06Release-Engineering-Team, 05Goal, 13Patch-For-Review, and 3 others: Prepare and maintain base container images - https://phabricator.wikimedia.org/T162042#3227446 (10akosiaris) >>! In T162042#3221344, @mobrovac wrote: >>>! In T162042#3220775, @MoritzMuehlenhoff wrote: >> I don't think we ne... [10:29:14] (03PS1) 10Hashar: nodepool: bump python-jenkins [puppet] - 10https://gerrit.wikimedia.org/r/351280 (https://phabricator.wikimedia.org/T144106) [10:32:55] 06Operations, 06Release-Engineering-Team, 05Goal, 13Patch-For-Review, and 3 others: Prepare and maintain base container images - https://phabricator.wikimedia.org/T162042#3227451 (10MoritzMuehlenhoff) Definitely, zotero can reside in sca* until trusty support runs out in two years. [10:37:22] (03PS1) 10Giuseppe Lavagetto: role::ocg: add local redises [puppet] - 10https://gerrit.wikimedia.org/r/351284 [10:43:40] (03CR) 10Muehlenhoff: [C: 032] nodepool: bump python-jenkins [puppet] - 10https://gerrit.wikimedia.org/r/351280 (https://phabricator.wikimedia.org/T144106) (owner: 10Hashar) [10:44:26] !log create new ganeti nodegroup called row_A holding ganeti2005, ganeti2006. Renamed the default nodegroup to row_B. T164011 [10:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:35] T164011: codfw: ganeti2007-ganeti2008 racking and onsite setup task - https://phabricator.wikimedia.org/T164011 [10:47:23] (03PS2) 10Alexandros Kosiaris: ores: Add twemproxy support [puppet] - 10https://gerrit.wikimedia.org/r/350421 (https://phabricator.wikimedia.org/T122676) [10:48:21] (03CR) 10jerkins-bot: [V: 04-1] ores: Add twemproxy support [puppet] - 10https://gerrit.wikimedia.org/r/350421 (https://phabricator.wikimedia.org/T122676) (owner: 10Alexandros Kosiaris) [10:48:26] (03PS2) 10Giuseppe Lavagetto: role::ocg: add local redises [puppet] - 10https://gerrit.wikimedia.org/r/351284 [10:48:37] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] role::ocg: add local redises [puppet] - 10https://gerrit.wikimedia.org/r/351284 (owner: 10Giuseppe Lavagetto) [10:50:10] !log upgrading python-jenkins on labnodepool1001 to 0.4.11 [10:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:10] !log Restarting Nodepool with python-jenkins 0.4.11 [10:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:17] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:54:34] (03PS3) 10Alexandros Kosiaris: ores: Add twemproxy support [puppet] - 10https://gerrit.wikimedia.org/r/350421 (https://phabricator.wikimedia.org/T122676) [10:54:57] PROBLEM - puppet last run on ocg1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[nutcracker] [10:57:18] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [10:58:20] 06Operations, 10Analytics, 10Traffic: Add VSL error counters to Varnishkafka stats - https://phabricator.wikimedia.org/T164259#3227497 (10elukey) [10:59:56] <_joe_> uhm [11:02:13] (03PS1) 10Filippo Giunchedi: swift: default to 127.0.0.1 for memcached [puppet] - 10https://gerrit.wikimedia.org/r/351285 (https://phabricator.wikimedia.org/T162247) [11:03:32] (03PS2) 10Filippo Giunchedi: swift: default to 127.0.0.1 for memcached [puppet] - 10https://gerrit.wikimedia.org/r/351285 (https://phabricator.wikimedia.org/T162247) [11:05:31] (03PS1) 10Phuedx: pagePreviews: Create pp_stage0.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351286 (https://phabricator.wikimedia.org/T162672) [11:05:33] (03PS1) 10Phuedx: pagePreviews: Deploy to first 50 of stage 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351287 (https://phabricator.wikimedia.org/T162672) [11:05:57] (03CR) 10Phuedx: [C: 04-2] "-2 pending discussion on T162672." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351287 (https://phabricator.wikimedia.org/T162672) (owner: 10Phuedx) [11:06:09] !log rebooting rdb1002 for kernel update to Linux 4.9 [11:06:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:26] (03CR) 10jerkins-bot: [V: 04-1] pagePreviews: Create pp_stage0.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351286 (https://phabricator.wikimedia.org/T162672) (owner: 10Phuedx) [11:06:36] (03CR) 10jerkins-bot: [V: 04-1] pagePreviews: Deploy to first 50 of stage 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351287 (https://phabricator.wikimedia.org/T162672) (owner: 10Phuedx) [11:07:56] (03PS1) 10Giuseppe Lavagetto: role::ocg: bind redis to 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/351289 [11:08:03] (03PS1) 10Hashar: Revert "nodepool: bump python-jenkins" [puppet] - 10https://gerrit.wikimedia.org/r/351290 [11:08:27] (03PS2) 10Hashar: Revert "nodepool: bump python-jenkins" [puppet] - 10https://gerrit.wikimedia.org/r/351290 [11:09:06] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] role::ocg: bind redis to 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/351289 (owner: 10Giuseppe Lavagetto) [11:10:28] 06Operations, 06Analytics-Kanban, 10DBA, 15User-Elukey: Puppetize Piwik's Database and set up periodical backups - https://phabricator.wikimedia.org/T164073#3227526 (10elukey) [11:10:57] RECOVERY - puppet last run on ocg1001 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [11:13:28] (03PS3) 10Muehlenhoff: Revert "nodepool: bump python-jenkins" [puppet] - 10https://gerrit.wikimedia.org/r/351290 (owner: 10Hashar) [11:16:09] PROBLEM - nutcracker port on ocg1003 is CRITICAL: Connection refused [11:17:58] PROBLEM - nutcracker port on ocg1001 is CRITICAL: Connection refused [11:18:38] PROBLEM - nutcracker port on ocg1002 is CRITICAL: Connection refused [11:20:32] (03CR) 10Muehlenhoff: [V: 032 C: 032] Revert "nodepool: bump python-jenkins" [puppet] - 10https://gerrit.wikimedia.org/r/351290 (owner: 10Hashar) [11:22:13] listen: 127.0.0.1:6378 ? [11:23:17] !log downgraded python-jenkins on labnodepool1001 to 0.2.1 (0.4.11 is still broken with the new Jenkins LTS) [11:23:21] ah ok the check wants 11212 [11:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:57] !log Restarting Nodepool [11:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:16] (03PS1) 10Elukey: role::ocg: set nutcracker port to 11212 [puppet] - 10https://gerrit.wikimedia.org/r/351292 [11:26:05] (03CR) 10Elukey: [V: 032 C: 032] role::ocg: set nutcracker port to 11212 [puppet] - 10https://gerrit.wikimedia.org/r/351292 (owner: 10Elukey) [11:27:54] (03PS2) 10Gehel: elasticsearch - cleanup hiera lookups with default "undef" [puppet] - 10https://gerrit.wikimedia.org/r/350413 [11:28:08] RECOVERY - nutcracker port on ocg1003 is OK: TCP OK - 0.000 second response time on port 11212 [11:29:16] <_joe_> uhm [11:29:27] <_joe_> the check for nutcracker is clearly wrong there [11:29:56] yes yes but I thought that having the same port everywhere would have been cleaner during debug [11:29:58] RECOVERY - nutcracker port on ocg1001 is OK: TCP OK - 0.000 second response time on port 11212 [11:30:07] we can revert if you want [11:30:25] I thought to do it since it was harmless [11:30:29] !log rebooting rdb1004 for kernel update to Linux 4.9 [11:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:38] RECOVERY - nutcracker port on ocg1002 is OK: TCP OK - 0.000 second response time on port 11212 [11:30:39] <_joe_> elukey: seriously? [11:30:40] <_joe_> :P [11:30:47] <_joe_> that's a redis port [11:30:53] <_joe_> not a memcached one [11:31:06] <_joe_> let's fix this properly [11:31:22] it is a proxy port to redis, this is what I see [11:31:49] <_joe_> it is, told you 11212 is not a standard redis port :P [11:31:55] <_joe_> but tbh, ocg [11:31:57] <_joe_> so it's ok [11:33:56] (03PS1) 10Giuseppe Lavagetto: role::ocg: switch to use the local nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/351293 [11:35:24] moritzm: Nodepool/Jenkins etc looks all good now. Thank you! [11:35:24] (03CR) 10Gehel: [C: 032] elasticsearch - cleanup hiera lookups with default "undef" [puppet] - 10https://gerrit.wikimedia.org/r/350413 (owner: 10Gehel) [11:37:03] !log restart of relforge cluster to activate hebrew plugin [11:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:49] (03CR) 10Giuseppe Lavagetto: [C: 032] role::ocg: switch to use the local nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/351293 (owner: 10Giuseppe Lavagetto) [11:38:53] 06Operations, 10Analytics, 10Traffic: Add VSL error counters to Varnishkafka stats - https://phabricator.wikimedia.org/T164259#3227497 (10JAllemandou) +1 for that! Thanks @elukey for raising this. [11:38:55] (03PS2) 10Giuseppe Lavagetto: role::ocg: switch to use the local nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/351293 [11:39:54] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] role::ocg: switch to use the local nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/351293 (owner: 10Giuseppe Lavagetto) [11:41:41] (03PS1) 10Volans: Mediawiki: allow to manage ReadOnly and master DC via conftool [switchdc] - 10https://gerrit.wikimedia.org/r/351295 (https://phabricator.wikimedia.org/T163398) [11:41:43] (03PS1) 10Volans: MediaWiki tasks: switch to use Conftool based config [switchdc] - 10https://gerrit.wikimedia.org/r/351296 (https://phabricator.wikimedia.org/T163398) [11:43:49] <_joe_> ouch [11:44:05] (03PS1) 10Giuseppe Lavagetto: role::ocg: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/351297 [11:44:22] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] role::ocg: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/351297 (owner: 10Giuseppe Lavagetto) [11:45:12] 06Operations, 10MediaWiki-Configuration, 06MediaWiki-Platform-Team, 06Performance-Team, and 10 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3227554 (10Volans) thanks @tstarling! [11:47:22] !log rebooting rdb1006 for kernel update to Linux 4.9 [11:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:38] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - ocg_8000 - Could not depool server ocg1003.eqiad.wmnet because of too many down! [11:47:38] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - ocg_8000 - Could not depool server ocg1003.eqiad.wmnet because of too many down! [11:47:52] PROBLEM - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 {channel:frontend.error,request:{id:1493725662227-26612},error:{message:Status check failed (redis failure?)}} - 232 bytes in 0.081 second response time [11:48:12] <_joe_> it should be ok now [11:48:18] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: http status 500 [11:48:20] <_joe_> it's a bit late [11:49:18] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 12472 msg: ocg_render_job_queue 0 msg [11:50:09] <_joe_> so what happened: ocg uses the "info" command from redis, that nutcracker doesn't support [11:50:38] PROBLEM - PyBal backends health check on lvs1009 is CRITICAL: PYBAL CRITICAL - ocg_8000 - Could not depool server ocg1002.eqiad.wmnet because of too many down! [11:50:43] 06Operations, 10netops: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3227560 (10ayounsi) On a side note, not relying on IPv6 RA, and using static routes/IPs (see T102099) on at least the nodes that use IGMP snooping would work a... [11:51:08] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: http status 500 [11:51:18] PROBLEM - PyBal backends health check on lvs1012 is CRITICAL: PYBAL CRITICAL - ocg_8000 - Could not depool server ocg1002.eqiad.wmnet because of too many down! [11:52:52] RECOVERY - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 467 bytes in 0.077 second response time [11:53:03] <_joe_> sorry for the noise [11:53:14] <_joe_> this time it wasn't a full outage, but still :/ [11:53:18] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 12778 msg: ocg_render_job_queue 0 msg [11:53:18] RECOVERY - PyBal backends health check on lvs1012 is OK: PYBAL OK - All pools are healthy [11:53:38] RECOVERY - PyBal backends health check on lvs1009 is OK: PYBAL OK - All pools are healthy [11:53:38] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [11:53:39] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [11:54:17] 06Operations, 10netops: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3227569 (10faidon) >>! In T133387#3227560, @ayounsi wrote: > On a side note, not relying on IPv6 RA, and using static routes/IPs (see T102099) on at least the... [12:00:40] (03CR) 10Faidon Liambotis: [C: 04-1] "Typo, per Daniel." [puppet] - 10https://gerrit.wikimedia.org/r/350777 (owner: 10Faidon Liambotis) [12:04:02] (03PS1) 10Giuseppe Lavagetto: ocg: use a redis instance, not nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/351299 [12:07:40] (03CR) 10Giuseppe Lavagetto: [C: 032] ocg: use a redis instance, not nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/351299 (owner: 10Giuseppe Lavagetto) [12:09:29] PROBLEM - nova-compute process on labvirt1007 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [12:10:07] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3227606 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2020.codfw.wmnet'... [12:10:28] RECOVERY - nova-compute process on labvirt1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [12:15:22] <_joe_> !log manually set ocg1001,3 to be redis slaves of ocg1002 [12:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:45] (03PS1) 10Giuseppe Lavagetto: ocg1001: put back into rotation [puppet] - 10https://gerrit.wikimedia.org/r/351301 (https://phabricator.wikimedia.org/T161158) [12:18:12] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] ocg1001: put back into rotation [puppet] - 10https://gerrit.wikimedia.org/r/351301 (https://phabricator.wikimedia.org/T161158) (owner: 10Giuseppe Lavagetto) [12:18:17] (03Abandoned) 10Gergő Tisza: Whitelist TSG for account creation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325740 (https://phabricator.wikimedia.org/T152588) (owner: 10Gergő Tisza) [12:19:29] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=pdf,name=ocg1001.eqiad.wmnet [12:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:29] !log rebooting rdb1008 for kernel update to Linux 4.9 [12:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:50] (03PS1) 10Alexandros Kosiaris: Assign IPs for ganeti2007, ganeti2008 [dns] - 10https://gerrit.wikimedia.org/r/351303 (https://phabricator.wikimedia.org/T164011) [12:28:29] (03PS1) 10Alexandros Kosiaris: Renumber sca2004 in private1-a-codfw [dns] - 10https://gerrit.wikimedia.org/r/351304 [12:36:51] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3227649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2020.codfw.wmnet'] ``` and were **ALL** successful. [12:38:41] 06Operations, 06Analytics-Kanban, 10DBA, 15User-Elukey: Puppetize Piwik's Database and set up periodical backups - https://phabricator.wikimedia.org/T164073#3227650 (10elukey) About those 66GB: Piwik uses ~18.5 GB of data, but /var/lib/mysql/ibdata1 is 66GB (no innodb_file_per_table set). [13:03:34] !log rebuild mismounted FSes on ms-be1036 - T163673 [13:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:43] T163673: Some swift disks wrongly mounted on 5 ms-be hosts - https://phabricator.wikimedia.org/T163673 [13:13:29] !log load testing elastic2020 before putting it back in the cluster - T149006 [13:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:37] T149006: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006 [13:15:56] !log cache_maps: upgrade varnish to 4.1.6-1wm1 [13:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:10] 06Operations, 10Analytics, 10Traffic: Add VSL error counters to Varnishkafka stats - https://phabricator.wikimedia.org/T164259#3227754 (10Ottomata) Why not both!? :) [13:26:54] !log stopping load on elastic2020 - T149006 [13:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:03] T149006: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006 [13:27:06] 06Operations, 10Wikimedia-Site-requests: Lost 2FA details, request recovery. - https://phabricator.wikimedia.org/T164265#3227770 (10Zppix) [13:29:10] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3227776 (10Gehel) one of the SSD is in error, waiting for the new one to arrive before running new load tests. [13:34:56] 06Operations, 06Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Update wikitech-static and develop procedures to keep it maintained - https://phabricator.wikimedia.org/T163721#3227787 (10Andrew) I ack'ed the alert yesterday and am working on it. It's composer, of course. [13:35:28] PROBLEM - Check Varnish expiry mailbox lag on cp2005 is CRITICAL: CRITICAL: expiry mailbox lag is 686727 [13:39:47] !log rebooting rdb1001 for update to latest 4.4 kernel [13:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:08] PROBLEM - Check Varnish expiry mailbox lag on cp2011 is CRITICAL: CRITICAL: expiry mailbox lag is 618822 [13:52:17] (03PS1) 10Volans: cache::text: switch all mediawiki to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/351313 (https://phabricator.wikimedia.org/T160178) [13:52:42] !log rebooting rdb1003 for update to latest 4.4 kernel [13:52:48] (03CR) 10Volans: [C: 04-2] "Waiting for the actual switch" [puppet] - 10https://gerrit.wikimedia.org/r/351313 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [13:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:13] hm [14:17:17] (03PS1) 10Volans: discovery::app_routes: switch mediawiki to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/351315 (https://phabricator.wikimedia.org/T160178) [14:18:02] (03CR) 10Volans: [C: 04-2] "Waiting for the actual switch" [puppet] - 10https://gerrit.wikimedia.org/r/351315 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:22:25] !log rebooting rdb1005 for update to latest 4.4 kernel [14:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:13] (03CR) 10Giuseppe Lavagetto: [C: 031] MediaWiki: add siteinfo check [switchdc] - 10https://gerrit.wikimedia.org/r/351279 (https://phabricator.wikimedia.org/T163398) (owner: 10Volans) [14:25:34] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Mediawiki: allow to manage ReadOnly and master DC via conftool (031 comment) [switchdc] - 10https://gerrit.wikimedia.org/r/351295 (https://phabricator.wikimedia.org/T163398) (owner: 10Volans) [14:26:48] RECOVERY - are wikitech and wt-static in sync on labtestweb2001 is OK: wikitech-static OK - wikitech and wikitech-static in sync (49721 200000s) [14:26:48] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (49721 200000s) [14:28:31] (03CR) 10Volans: [C: 032] MediaWiki: add siteinfo check [switchdc] - 10https://gerrit.wikimedia.org/r/351279 (https://phabricator.wikimedia.org/T163398) (owner: 10Volans) [14:29:43] (03PS2) 10Volans: Mediawiki: allow to manage ReadOnly and master DC via conftool [switchdc] - 10https://gerrit.wikimedia.org/r/351295 (https://phabricator.wikimedia.org/T163398) [14:29:50] (03CR) 10Volans: "Fixed" (031 comment) [switchdc] - 10https://gerrit.wikimedia.org/r/351295 (https://phabricator.wikimedia.org/T163398) (owner: 10Volans) [14:33:00] 06Operations, 06Labs, 10Labs-Infrastructure, 10Wikimedia-Apache-configuration, and 2 others: wikitech-static sync broken - https://phabricator.wikimedia.org/T101803#3227922 (10Andrew) 05Open>03Resolved I upgraded wikitech to 1.28.2 a few days ago and there was some composer/syntax highlighting snafu th... [14:33:50] 06Operations, 06Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Update wikitech-static and develop procedures to keep it maintained - https://phabricator.wikimedia.org/T163721#3227925 (10Andrew) The alerting system seems to be working for this. We haven't designated a specific person in charge, but ma... [14:33:53] (03CR) 10Giuseppe Lavagetto: [C: 031] MediaWiki tasks: switch to use Conftool based config (031 comment) [switchdc] - 10https://gerrit.wikimedia.org/r/351296 (https://phabricator.wikimedia.org/T163398) (owner: 10Volans) [14:35:32] !log rebooting rdb1007 for update to latest 4.4 kernel [14:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:59] (03CR) 10Giuseppe Lavagetto: [C: 031] Mediawiki: allow to manage ReadOnly and master DC via conftool [switchdc] - 10https://gerrit.wikimedia.org/r/351295 (https://phabricator.wikimedia.org/T163398) (owner: 10Volans) [14:38:53] (03CR) 10Volans: "reply inline" (031 comment) [switchdc] - 10https://gerrit.wikimedia.org/r/351296 (https://phabricator.wikimedia.org/T163398) (owner: 10Volans) [14:41:00] (03PS1) 10Alexandros Kosiaris: Remove role:: prefix in hiera_lookup tool [puppet] - 10https://gerrit.wikimedia.org/r/351319 [14:41:32] (03PS1) 10Giuseppe Lavagetto: Etcd: allow reads to happen in the nearest datacenter [dns] - 10https://gerrit.wikimedia.org/r/351320 [14:42:15] 06Operations, 10Analytics, 10Traffic: Add VSL error counters to Varnishkafka stats - https://phabricator.wikimedia.org/T164259#3227966 (10elukey) >>! In T164259#3227754, @Ottomata wrote: > Why not both!? :) Yes! I was concerned that the new field would have been a bit too much, but if we are ok with the new... [14:43:38] PROBLEM - Check Varnish expiry mailbox lag on cp2008 is CRITICAL: CRITICAL: expiry mailbox lag is 601664 [14:44:45] (03CR) 10Volans: [C: 032] Mediawiki: allow to manage ReadOnly and master DC via conftool [switchdc] - 10https://gerrit.wikimedia.org/r/351295 (https://phabricator.wikimedia.org/T163398) (owner: 10Volans) [14:44:49] (03PS3) 10Volans: Mediawiki: allow to manage ReadOnly and master DC via conftool [switchdc] - 10https://gerrit.wikimedia.org/r/351295 (https://phabricator.wikimedia.org/T163398) [14:44:51] (03CR) 10jerkins-bot: [V: 04-1] Mediawiki: allow to manage ReadOnly and master DC via conftool [switchdc] - 10https://gerrit.wikimedia.org/r/351295 (https://phabricator.wikimedia.org/T163398) (owner: 10Volans) [14:45:53] (03PS2) 10Volans: MediaWiki tasks: switch to use Conftool based config [switchdc] - 10https://gerrit.wikimedia.org/r/351296 (https://phabricator.wikimedia.org/T163398) [14:46:28] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 33 probes of 445 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [14:48:41] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 54 probes of 431 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [14:52:11] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 41 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [14:56:40] (03CR) 10Volans: [C: 032] MediaWiki tasks: switch to use Conftool based config [switchdc] - 10https://gerrit.wikimedia.org/r/351296 (https://phabricator.wikimedia.org/T163398) (owner: 10Volans) [14:57:53] (03PS1) 10Giuseppe Lavagetto: profile::etcd::replication: make replication errors page [puppet] - 10https://gerrit.wikimedia.org/r/351323 [15:01:19] !log stop and masked memcached on mc10[01-18].eqiad.wmnet [15:01:21] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 4 probes of 445 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [15:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:11] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 13 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [15:03:41] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 3 probes of 431 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [15:03:41] RECOVERY - Check Varnish expiry mailbox lag on cp2008 is OK: OK: expiry mailbox lag is 0 [15:07:30] (03CR) 10Alexandros Kosiaris: [C: 031] Etcd: allow reads to happen in the nearest datacenter [dns] - 10https://gerrit.wikimedia.org/r/351320 (owner: 10Giuseppe Lavagetto) [15:10:40] (03CR) 10Alexandros Kosiaris: [C: 031] profile::etcd::replication: make replication errors page [puppet] - 10https://gerrit.wikimedia.org/r/351323 (owner: 10Giuseppe Lavagetto) [15:14:26] 06Operations, 10ops-codfw, 10DBA, 10netops: db20[7-9][0-9] switch ports configuration - https://phabricator.wikimedia.org/T162944#3228091 (10Papaul) @Robh any update on this? [15:14:35] (03CR) 10Giuseppe Lavagetto: [C: 032] Etcd: allow reads to happen in the nearest datacenter [dns] - 10https://gerrit.wikimedia.org/r/351320 (owner: 10Giuseppe Lavagetto) [15:18:11] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::etcd::replication: make replication errors page [puppet] - 10https://gerrit.wikimedia.org/r/351323 (owner: 10Giuseppe Lavagetto) [15:18:42] RECOVERY - Disk space on graphite1003 is OK: DISK OK [15:20:02] !log add 100G to graphite1003 and graphite2002 [15:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:31] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3228120 (10Papaul) After replacing the main board. at first book HP ILO detected that one of the SSD's was bad. After a couple of reboots the e... [15:24:47] (03CR) 10Alexandros Kosiaris: [C: 032] Remove role:: prefix in hiera_lookup tool [puppet] - 10https://gerrit.wikimedia.org/r/351319 (owner: 10Alexandros Kosiaris) [15:24:51] (03PS2) 10Alexandros Kosiaris: Remove role:: prefix in hiera_lookup tool [puppet] - 10https://gerrit.wikimedia.org/r/351319 [15:24:52] <_joe_> !log restarting confd in eqiad/esams to pick up the server change [15:24:55] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Remove role:: prefix in hiera_lookup tool [puppet] - 10https://gerrit.wikimedia.org/r/351319 (owner: 10Alexandros Kosiaris) [15:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:00] 06Operations, 10ops-codfw, 10DBA, 10netops: db20[7-9][0-9] switch ports configuration - https://phabricator.wikimedia.org/T162944#3228127 (10RobH) Nope, I forgot about it! I'll knock them out now. [15:36:04] !log cache_misc: upgrade varnish to 4.1.6-1wm1 [15:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:11] (03PS1) 10Gehel: logstash - delete all indices older than 31 days [puppet] - 10https://gerrit.wikimedia.org/r/351327 [15:43:22] 06Operations, 10ops-codfw, 10DBA, 10netops: db20[7-9][0-9] switch ports configuration - https://phabricator.wikimedia.org/T162944#3228138 (10RobH) row c done [15:44:34] (03CR) 10jerkins-bot: [V: 04-1] logstash - delete all indices older than 31 days [puppet] - 10https://gerrit.wikimedia.org/r/351327 (owner: 10Gehel) [15:53:57] !log oblivian@puppetmaster1001 conftool action : set/@read-only.yaml; selector: name=ReadOnly,scope=eqiad [15:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:57] (03PS2) 10Elukey: Replace mc100[123] with mc10(19|2[01]) after hw refresh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351254 (https://phabricator.wikimedia.org/T137345) [15:55:15] _joe_ --^ (if you have time) [15:56:14] <_joe_> elukey: why not one per row? [15:56:23] <_joe_> that's pretty unfortunate :) [15:56:58] I completely forgot to check that, good that you asked :) [15:57:01] fixing it [15:57:05] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Replace mc100[123] with mc10(19|2[01]) after hw refresh (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351254 (https://phabricator.wikimedia.org/T137345) (owner: 10Elukey) [15:58:59] mc[1019-1023].eqiad.wmnet are all in row a of course :D [15:59:33] (03PS1) 10BBlack: Revert "traffic: depool eqiad from user traffic" [dns] - 10https://gerrit.wikimedia.org/r/351330 [15:59:38] (03PS2) 10BBlack: Revert "traffic: depool eqiad from user traffic" [dns] - 10https://gerrit.wikimedia.org/r/351330 [16:02:12] jouncebot: next [16:02:12] In 140 hour(s) and 57 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170508T1300) [16:02:35] ah that's right, no deployments this week [16:03:26] (03PS3) 10Elukey: Replace mc100[123] with mc10(19|2[01]) after hw refresh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351254 (https://phabricator.wikimedia.org/T137345) [16:03:46] https://media2.giphy.com/media/3o7abldj0b3rxrZUxW/giphy.mp4 [16:03:51] (03PS2) 10Gehel: logstash - delete all indices older than 31 days [puppet] - 10https://gerrit.wikimedia.org/r/351327 [16:04:41] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: codfw rack/setup 22 DB servers - https://phabricator.wikimedia.org/T162159#3228214 (10RobH) [16:04:44] 06Operations, 10ops-codfw, 10DBA, 10netops: db20[7-9][0-9] switch ports configuration - https://phabricator.wikimedia.org/T162944#3228212 (10RobH) 05Open>03Resolved row d done [16:05:26] (03CR) 10Giuseppe Lavagetto: [C: 031] Replace mc100[123] with mc10(19|2[01]) after hw refresh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351254 (https://phabricator.wikimedia.org/T137345) (owner: 10Elukey) [16:05:48] (03PS4) 10Elukey: Replace Redis lock IPs (mc100[123]) after hw refresh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351254 (https://phabricator.wikimedia.org/T137345) [16:07:45] (03CR) 10Elukey: [C: 032] Replace Redis lock IPs (mc100[123]) after hw refresh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351254 (https://phabricator.wikimedia.org/T137345) (owner: 10Elukey) [16:07:56] (03CR) 10jenkins-bot: Replace Redis lock IPs (mc100[123]) after hw refresh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351254 (https://phabricator.wikimedia.org/T137345) (owner: 10Elukey) [16:08:12] going to deploy the Redis lock change now [16:08:23] volans: --^ [16:09:20] elukey: ack [16:12:00] greg-g: Can I get an emergency deploy slot for a UBN in VE? T164157 – we broke CAPTCHA support. :-( [16:12:01] T164157: [Regression] ‘Empty server response’ on saving through VE when CAPTCHA issued (API response terminated?) - https://phabricator.wikimedia.org/T164157 [16:14:30] James_F: probably, asking Ops re timing now [16:14:35] Thanks. [16:16:16] !log elukey@naos Synchronized wmf-config/ProductionServices.php: Replace Redis lock IPs after hw refresh (duration: 01m 16s) [16:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:13] James_F: consensus is go ahead [16:22:28] greg-g: Thanks. RoanKattouw, please go. [16:24:37] 06Operations, 13Patch-For-Review, 15User-Elukey: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850#3228281 (10elukey) 05Open>03Resolved Today Joe added some Redis instance to the ocg hosts to decouple it from the job queues, hopefully we shouldn't see this err... [16:25:22] 06Operations, 10Beta-Cluster-Infrastructure, 07HHVM: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#3228284 (10elukey) [16:25:36] 06Operations, 13Patch-For-Review: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#3228285 (10elukey) [16:25:57] !log OS install on new db servers [16:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:24] (03CR) 10RobH: [C: 031] Remove mgmt dns records for mw2090->mw2096 [dns] - 10https://gerrit.wikimedia.org/r/350813 (https://phabricator.wikimedia.org/T161488) (owner: 10Elukey) [16:29:27] 06Operations, 13Patch-For-Review, 15User-Elukey: Rack/Setup new memcache servers mc1019-36 - https://phabricator.wikimedia.org/T137345#3228307 (10elukey) 05Open>03Resolved [16:29:38] <_joe_> !log testing (not dry-run) cache wipe/warmup and redis resync for the switchover codfw->eqiad [16:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:49] !log START - Wipe and warmup caches in codfw - t04_cache_wipe (switchdc/oblivian@neodymium) [16:29:54] !log START - Resync the redis for jobqueues in eqiad with the masters in codfw - t04_resync_redis (switchdc/oblivian@neodymium) [16:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:31] <_joe_> !log message about cache warmup is wrong, it is being executed in eqiad [16:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:41] PROBLEM - Check health of redis instance on 6379 on rdb1001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 [16:32:42] !log END (PASS) - Resync the redis for jobqueues in eqiad with the masters in codfw - t04_resync_redis (switchdc/oblivian@neodymium) [16:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:41] RECOVERY - Check health of redis instance on 6379 on rdb1001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8480966 keys, up 3 minutes 42 seconds - replication_delay is 0 [16:34:33] 06Operations: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T163280#3228333 (10Cmjohnson) The disk has been replaced, can someone rebuild the raid please [16:35:17] 06Operations, 10ops-eqiad: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T164202#3228351 (10Cmjohnson) [16:35:21] 06Operations: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T163280#3228353 (10Cmjohnson) [16:35:54] (03PS1) 10Volans: t04: fix title [switchdc] - 10https://gerrit.wikimedia.org/r/351335 (https://phabricator.wikimedia.org/T160178) [16:36:00] !log END (PASS) - Wipe and warmup caches in codfw - t04_cache_wipe (switchdc/oblivian@neodymium) [16:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:06] !log ppchelko@naos Started deploy [restbase/deploy@6adb0f2]: Summary endpoint enhancements [16:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:00] 06Operations, 10ops-eqiad, 10Cassandra, 13Patch-For-Review, 06Services (doing): Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T163292#3228380 (10Eevans) [16:38:43] hello! could i get some help from the person on clinic duty today? [16:40:29] (03CR) 10Volans: [C: 032] t04: fix title [switchdc] - 10https://gerrit.wikimedia.org/r/351335 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [16:40:39] (03CR) 10Giuseppe Lavagetto: [V: 032] t04: fix title [switchdc] - 10https://gerrit.wikimedia.org/r/351335 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [16:40:46] <_joe_> merge merge [16:40:48] <_joe_> :P [16:40:54] done :D [16:42:03] apergos: FYI (few lines above) ^^^ [16:42:11] RECOVERY - Check Varnish expiry mailbox lag on cp2011 is OK: OK: expiry mailbox lag is 33069 [16:42:15] ah [16:42:20] nuria_: what can I do for you? [16:42:41] apergos: thank you! elukey is taking care of it [16:42:50] okey dokey [16:42:51] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=333.10 Read Requests/Sec=350.60 Write Requests/Sec=3.70 KBytes Read/Sec=44125.20 KBytes_Written/Sec=81.60 [16:42:54] !log ppchelko@naos Finished deploy [restbase/deploy@6adb0f2]: Summary endpoint enhancements (duration: 05m 47s) [16:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:11] !log ppchelko@naos Started deploy [restbase/deploy@6adb0f2]: Summary endpoint enhancements. Restart after a check fail [16:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:43] (03PS3) 10Filippo Giunchedi: swift: default to 127.0.0.1 for memcached [puppet] - 10https://gerrit.wikimedia.org/r/351285 (https://phabricator.wikimedia.org/T162247) [16:46:52] James_F: greg-g: Sorry for the delay, going to start deploying now [16:47:11] !log testing (not dry-run) tasks for tomorrow's switchover in reverse mode eqiad->codfw [16:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:16] !log ppchelko@naos Started deploy [restbase/deploy@6adb0f2]: Summary endpoint enhancements. Restart after a check timeout [16:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:11] !log START - Reduce the TTL of all the MediaWiki read-write discovery records - t00_reduce_ttl (switchdc/volans@neodymium) [16:50:12] !log END (FAIL) - Reduce the TTL of all the MediaWiki read-write discovery records - t00_reduce_ttl (switchdc/volans@neodymium) [16:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:22] (03CR) 10Filippo Giunchedi: [C: 032] "noop in production, PCC https://puppet-compiler.wmflabs.org/6268/" [puppet] - 10https://gerrit.wikimedia.org/r/351285 (https://phabricator.wikimedia.org/T162247) (owner: 10Filippo Giunchedi) [16:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:50] !log START - Reduce the TTL of all the MediaWiki read-write discovery records - t00_reduce_ttl (switchdc/volans@neodymium) [16:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:01] !log END (PASS) - Reduce the TTL of all the MediaWiki read-write discovery records - t00_reduce_ttl (switchdc/volans@neodymium) [16:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:41] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:51:56] !log START - Disabling puppet on selected hosts in eqiad and codfw - t00_disable_puppet (switchdc/volans@neodymium) [16:52:02] !log END (PASS) - Disabling puppet on selected hosts in eqiad and codfw - t00_disable_puppet (switchdc/volans@neodymium) [16:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:51] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=3.80 Read Requests/Sec=0.40 Write Requests/Sec=1.80 KBytes Read/Sec=2.00 KBytes_Written/Sec=48.40 [16:53:07] !log START - Stop MediaWiki jobrunners, videoscalers and cronjobs in eqiad - t01_stop_maintenance (switchdc/volans@neodymium) [16:53:11] !log END (FAIL) - Stop MediaWiki jobrunners, videoscalers and cronjobs in eqiad - t01_stop_maintenance (switchdc/volans@neodymium) [16:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:25] <_joe_> that's expected [16:54:50] I see a lot of red stuff popping up in icinga [16:55:06] rb/mobileapps/scb all at 1-2/3 building up to alerts [16:55:31] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [16:56:01] a lot of it self-cleared before reaching 3/3, only a little reached icinga-wm reporting here [16:56:07] <_joe_> I think it's restbase in icinga [16:56:12] !log ppchelko@naos Finished deploy [restbase/deploy@6adb0f2]: Summary endpoint enhancements. Restart after a check timeout (duration: 07m 56s) [16:56:13] <_joe_> in eqiad sorry [16:56:17] <_joe_> during the deploy [16:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:36] !log START - Set MediaWiki in read-only mode in eqiad - t02_start_mediawiki_readonly (switchdc/volans@neodymium) [16:57:38] !log MediaWiki read-only period starts at: 2017-05-02 16:57:37.952132 (switchdc/volans@neodymium) [16:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:13] !log END (FAIL) - Set MediaWiki in read-only mode in eqiad - t02_start_mediawiki_readonly (switchdc/volans@neodymium) [16:58:13] Is it safe for me to deploy a MediaWiki patch now, or should I wait? [16:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:32] Greg approved it ~40 mins ago but I needed to emerge from a meeting [16:58:38] ^^^ expected, we're not RO right now of course [16:59:11] RoanKattouw: in the middle of the switchdc testing, but if it's urgent I guess we can hold on [17:00:14] <_joe_> volans: I think we can safely test the next few steps tbh [17:00:23] <_joe_> volans: we don't do scaps anymore, so... [17:00:31] volans: I can wait for you to finish, how long will that take? [17:00:32] yeah :D [17:00:42] RoanKattouw: go ahead [17:00:59] if anything can conflict we hold on that part until you finish [17:03:01] !log START - Set core DB masters in read-only mode in eqiad, ensure all masters are read-only - t03_coredb_masters_readonly (switchdc/volans@neodymium) [17:03:04] !log END (FAIL) - Set core DB masters in read-only mode in eqiad, ensure all masters are read-only - t03_coredb_masters_readonly (switchdc/volans@neodymium) [17:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:23] ^^^ expected, same, we're not RO right now of course [17:03:36] <_joe_> yeah it thankfully fails to verify :P [17:04:29] OK, syncing now [17:04:31] Thanks [17:04:58] thanks [17:05:19] !log catrope@naos Synchronized php-1.29.0-wmf.21/extensions/VisualEditor/modules/ve-mw/init/ve.init.mw.ArticleTarget.js: T164157 (duration: 01m 00s) [17:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:28] T164157: [Regression] ‘Empty server response’ on saving through VE when CAPTCHA issued (API response terminated?) - https://phabricator.wikimedia.org/T164157 [17:06:56] All done [17:07:12] RoanKattouw: thanks [17:07:21] just this one? [17:07:27] <_joe_> yup [17:07:32] ok [17:07:37] let's continue [17:07:48] <_joe_> yes [17:08:17] !log START - Switch MediaWiki master datacenter and read-write discovery records from eqiad to codfw - t05_switch_datacenter (switchdc/volans@neodymium) [17:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:53] !log END (FAIL) - Switch MediaWiki master datacenter and read-write discovery records from eqiad to codfw - t05_switch_datacenter (switchdc/volans@neodymium) [17:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:21] is it normal to fail? [17:11:29] because testing [17:11:32] ? [17:11:33] yes, one mw host broken [17:11:37] :( [17:11:39] oh [17:11:41] so not normal [17:12:35] <_joe_> jynus: we'll retry in a sec [17:13:38] (03PS1) 10Ottomata: Add Victoria Coleman as ldap_only_user so we can grant access to Pivot [puppet] - 10https://gerrit.wikimedia.org/r/351342 (https://phabricator.wikimedia.org/T164278) [17:14:01] (03PS1) 10Muehlenhoff: Record new MOU expiry date for Bob West [puppet] - 10https://gerrit.wikimedia.org/r/351343 [17:15:04] (03CR) 10Muehlenhoff: [C: 031] Add Victoria Coleman as ldap_only_user so we can grant access to Pivot [puppet] - 10https://gerrit.wikimedia.org/r/351342 (https://phabricator.wikimedia.org/T164278) (owner: 10Ottomata) [17:15:16] 2 bots but each talks about different changes.. hmm [17:15:18] (03CR) 10Muehlenhoff: [C: 032] Record new MOU expiry date for Bob West [puppet] - 10https://gerrit.wikimedia.org/r/351343 (owner: 10Muehlenhoff) [17:15:56] (03PS3) 10Madhuvishy: sge: Add gridengine-client package dependency to grid master and shadow-master [puppet] - 10https://gerrit.wikimedia.org/r/351214 (https://phabricator.wikimedia.org/T162955) [17:15:58] (03CR) 10Andrew Bogott: [C: 031] sge: Add gridengine-client package dependency to grid master and shadow-master [puppet] - 10https://gerrit.wikimedia.org/r/351214 (https://phabricator.wikimedia.org/T162955) (owner: 10Madhuvishy) [17:16:58] (03PS2) 10Ottomata: Add Victoria Coleman as ldap_only_user so we can grant access to Pivot [puppet] - 10https://gerrit.wikimedia.org/r/351342 (https://phabricator.wikimedia.org/T164278) [17:17:00] thanks moritzm [17:17:04] (03CR) 10Ottomata: [V: 032 C: 032] Add Victoria Coleman as ldap_only_user so we can grant access to Pivot [puppet] - 10https://gerrit.wikimedia.org/r/351342 (https://phabricator.wikimedia.org/T164278) (owner: 10Ottomata) [17:17:42] !log START - Switch traffic flow to the appservers from eqiad to codfw - t05_switch_traffic (switchdc/volans@neodymium) [17:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:38] (03CR) 10Madhuvishy: [C: 032] sge: Add gridengine-client package dependency to grid master and shadow-master [puppet] - 10https://gerrit.wikimedia.org/r/351214 (https://phabricator.wikimedia.org/T162955) (owner: 10Madhuvishy) [17:18:50] (03PS4) 10Madhuvishy: sge: Add gridengine-client package dependency to grid master and shadow-master [puppet] - 10https://gerrit.wikimedia.org/r/351214 (https://phabricator.wikimedia.org/T162955) [17:18:54] (03CR) 10Madhuvishy: [V: 032 C: 032] sge: Add gridengine-client package dependency to grid master and shadow-master [puppet] - 10https://gerrit.wikimedia.org/r/351214 (https://phabricator.wikimedia.org/T162955) (owner: 10Madhuvishy) [17:20:21] PROBLEM - Nginx local proxy to apache on mw2256 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1883 bytes in 9.912 second response time [17:20:21] PROBLEM - Apache HTTP on mw2256 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1883 bytes in 1.259 second response time [17:20:41] PROBLEM - HHVM rendering on mw2256 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1881 bytes in 0.030 second response time [17:20:43] !log END (PASS) - Switch traffic flow to the appservers from eqiad to codfw - t05_switch_traffic (switchdc/volans@neodymium) [17:20:43] <_joe_> that's me ^^ [17:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:11] RECOVERY - Nginx local proxy to apache on mw2256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.090 second response time [17:22:21] RECOVERY - Apache HTTP on mw2256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.102 second response time [17:22:41] RECOVERY - HHVM rendering on mw2256 is OK: HTTP OK: HTTP/1.1 200 OK - 73528 bytes in 0.128 second response time [17:23:26] !log START - Switch MediaWiki master datacenter and read-write discovery records from eqiad to codfw - t05_switch_datacenter (switchdc/volans@neodymium) [17:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:37] !log END (FAIL) - Switch MediaWiki master datacenter and read-write discovery records from eqiad to codfw - t05_switch_datacenter (switchdc/volans@neodymium) [17:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:03] * volans checking [17:25:17] BTW, there is no eqiad -> codfw replication at the moment, not sure if that can make things fail [17:25:31] <_joe_> jynus: no [17:26:24] I will put it back tomorrow morning, and is only needed in case of a failed switch [17:28:07] (03PS1) 10Volans: t05_switch_datacenter: fix typo in DNS checks [switchdc] - 10https://gerrit.wikimedia.org/r/351346 (https://phabricator.wikimedia.org/T160178) [17:30:00] (03CR) 10Volans: [V: 032 C: 032] t05_switch_datacenter: fix typo in DNS checks [switchdc] - 10https://gerrit.wikimedia.org/r/351346 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [17:31:02] !log START - Switch MediaWiki master datacenter and read-write discovery records from eqiad to codfw - t05_switch_datacenter (switchdc/volans@neodymium) [17:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:11] !log END (PASS) - Switch MediaWiki master datacenter and read-write discovery records from eqiad to codfw - t05_switch_datacenter (switchdc/volans@neodymium) [17:31:18] finally :D [17:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:00] !log START - Switch the Redis masters from eqiad to codfw and invert the replication - t06_redis (switchdc/volans@neodymium) [17:32:04] !log END (PASS) - Switch the Redis masters from eqiad to codfw and invert the replication - t06_redis (switchdc/volans@neodymium) [17:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:31] !log START - Set core DB masters in read-write mode in codfw, ensure masters in eqiad are read-only - t07_coredb_masters_readwrite (switchdc/volans@neodymium) [17:33:35] !log END (PASS) - Set core DB masters in read-write mode in codfw, ensure masters in eqiad are read-only - t07_coredb_masters_readwrite (switchdc/volans@neodymium) [17:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:24] !log oblivian@puppetmaster1001 conftool action : set/val=test; selector: name=ReadOnly,scope=codfw [17:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:45] !log START - Set MediaWiki in read-write mode in codfw - t08_stop_mediawiki_readonly (switchdc/volans@neodymium) [17:35:48] !log MediaWiki read-only period ends at: 2017-05-02 17:35:48.111079 (switchdc/volans@neodymium) [17:35:49] !log END (PASS) - Set MediaWiki in read-write mode in codfw - t08_stop_mediawiki_readonly (switchdc/volans@neodymium) [17:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:03] !log START - Restore the TTL of all the MediaWiki read-write discovery records and cleanup confd stale files - t09_restore_ttl (switchdc/volans@neodymium) [17:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:12] !log END (PASS) - Restore the TTL of all the MediaWiki read-write discovery records and cleanup confd stale files - t09_restore_ttl (switchdc/volans@neodymium) [17:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:34] !log mobrovac@naos Started deploy [restbase/deploy@6adb0f2]: (no justification provided) [17:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:02] !log START - Start MediaWiki jobrunners, videoscalers and maintenance in codfw - t09_start_maintenance (switchdc/volans@neodymium) [17:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:08] !log mobrovac@naos Finished deploy [restbase/deploy@6adb0f2]: (no justification provided) (duration: 01m 34s) [17:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:02] !log END (PASS) - Start MediaWiki jobrunners, videoscalers and maintenance in codfw - t09_start_maintenance (switchdc/volans@neodymium) [17:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:13] (03CR) 10Jdlrobson: "I thought that 'en' means any English project e..g enwikivoyage/enwikipedia etc which is why I left them in place in PS1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351166 (https://phabricator.wikimedia.org/T164044) (owner: 10Jdlrobson) [17:41:54] (03PS2) 10Jdlrobson: Correction to config definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351166 (https://phabricator.wikimedia.org/T164044) [17:44:40] !log mobrovac@naos Started deploy [restbase/deploy@6adb0f2]: Include displaytitle and page_id in the summary output and bump the content type version - T163729 T164079 [17:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:49] T163729: Summaries should include display title - https://phabricator.wikimedia.org/T163729 [17:44:49] T164079: Summaries should include page id - https://phabricator.wikimedia.org/T164079 [17:48:56] !log new db servers signing puppet certs,salt-key, initial run [17:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:14] (03CR) 10BBlack: [C: 032] Revert "traffic: depool eqiad from user traffic" [dns] - 10https://gerrit.wikimedia.org/r/351330 (owner: 10BBlack) [17:50:44] !log mobrovac@naos Finished deploy [restbase/deploy@6adb0f2]: Include displaytitle and page_id in the summary output and bump the content type version - T163729 T164079 (duration: 06m 04s) [17:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:53] T163729: Summaries should include display title - https://phabricator.wikimedia.org/T163729 [17:50:54] T164079: Summaries should include page id - https://phabricator.wikimedia.org/T164079 [17:51:27] !log codfw->eqiad switchback: end-user edge traffic back to normal @ eqiad ( https://gerrit.wikimedia.org/r/#/c/351330/ ) - 10 minute TTL for bulk traffic pattern shift starts now. [17:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:31] PROBLEM - Check Varnish expiry mailbox lag on cp3045 is CRITICAL: CRITICAL: expiry mailbox lag is 564036 [18:21:34] PROBLEM - MariaDB Slave Lag: s3 on db1015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 29393.05 seconds [18:21:57] PROBLEM - Check systemd state on elastic2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:22:08] PROBLEM - Elasticsearch HTTPS on elastic2020 is CRITICAL: SSL CRITICAL - failed to verify search.svc.codfw.wmnet against elastic2020.codfw.wmnet [18:23:19] PROBLEM - Disk space on ocg1003 is CRITICAL: DISK CRITICAL - free space: /srv/deployment/ocg/output 9488 MB (3% inode=98%) [18:23:55] PROBLEM - puppet last run on kubernetes1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 15 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] [18:24:04] PROBLEM - Check systemd state on labsdb1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:24:26] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.97, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f5c80655950: Failed to establish a new connection: [Errno 111] Connection refused,)) [18:24:36] PROBLEM - MD RAID on restbase1018 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 [18:24:36] ACKNOWLEDGEMENT - MD RAID on restbase1018 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T164287 [18:24:40] 06Operations, 10ops-eqiad: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T164287#3228743 (10ops-monitoring-bot) [18:24:45] PROBLEM - Check systemd state on labstore2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:24:46] PROBLEM - puppet last run on kubernetes1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 12 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] [18:24:57] PROBLEM - Restbase root url on restbase1018 is CRITICAL: connect to address 10.64.48.97 and port 7231: Connection refused [18:25:14] PROBLEM - cassandra-a CQL 10.64.48.98:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.98 and port 9042: Connection refused [18:25:34] PROBLEM - cassandra-a SSL 10.64.48.98:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [18:25:35] PROBLEM - puppet last run on kubernetes1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] [18:25:44] PROBLEM - cassandra-a service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [18:26:04] PROBLEM - cassandra-b CQL 10.64.48.99:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.99 and port 9042: Connection refused [18:26:14] PROBLEM - puppet last run on ms-be3003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdl] [18:26:14] PROBLEM - cassandra-b SSL 10.64.48.99:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [18:26:24] PROBLEM - puppet last run on ms-be1039 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 26 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdc] [18:26:24] PROBLEM - cassandra-b service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [18:26:24] PROBLEM - Router interfaces on pfw-eqiad is CRITICAL: CRITICAL: host 208.80.154.218, interfaces up: 108, down: 1, dormant: 0, excluded: 3, unused: 0BRge-11/0/2: down - frdb1002BR [18:26:34] PROBLEM - puppet last run on kubernetes1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 21 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] [18:26:34] PROBLEM - cassandra-c CQL 10.64.48.100:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.100 and port 9042: Connection refused [18:26:54] PROBLEM - cassandra-c SSL 10.64.48.100:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [18:27:04] PROBLEM - mediawiki-installation DSH group on mw2256 is CRITICAL: Host mw2256 is not in mediawiki-installation dsh group [18:27:04] PROBLEM - Check systemd state on restbase1018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:27:04] PROBLEM - cassandra-c service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [18:28:44] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 144 connecting: cp3003_v4, cp3003_v6, cp3009_v4, cp3009_v6 [18:29:24] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 144 connecting: cp3003_v4, cp3003_v6, cp3009_v4, cp3009_v6 [18:30:04] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 144 connecting: cp3009_v4, cp3009_v6 not-conn: cp3003_v4, cp3003_v6 [18:34:04] PROBLEM - HP RAID on ms-be1039 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [18:34:34] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:37:35] icinga downtimes were deleted *again* [18:37:40] did someone take restbase1018 out of maintenance? [18:37:46] icinga did [18:38:03] jynus: ummm :) [18:38:23] jynus: please tell me it hasn't become self-aware [18:38:27] T164206 [18:38:27] T164206: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206 [18:38:31] auh [18:38:49] 06Operations, 10Icinga, 10Monitoring: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3228791 (10jcrespo) This just happened again for the nth time. [18:38:50] jynus: thanks [18:38:55] jynus: I'm at dinner but look at einsteinium/tegmen crontabs, the one that syncs the state file between the two [18:39:20] there is a restart happening [18:39:26] that triggers it [18:41:19] Scheduling refresh of Service[icinga], but that is normal (reload) [18:42:46] segfault on libc [18:42:48] strange [18:44:27] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: codfw rack/setup 22 DB servers - https://phabricator.wikimedia.org/T162159#3228800 (10Papaul) db2084 can not boot to PXE. I am troubleshooting it. [18:44:38] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:45:04] tegmen icinga: Caught SIGHUP, restarting.. [18:45:18] Icinga 1.11.6 starting... [18:45:29] ok that's the reason [18:46:20] urandom, do you remember services in dowtime? [18:46:41] jynus: just the one i put there yesterday [18:46:52] restbase1018, for example? [18:46:57] yeah [18:47:00] all services or some in particular? [18:47:08] RECOVERY - Check Varnish expiry mailbox lag on cp3045 is OK: OK: expiry mailbox lag is 263 [18:47:11] most; i just re-added them [18:47:17] i think it was all of them [18:47:25] I am checking for a pattern [18:47:43] jynus: we could use the file from einsteinium if it was not already re-written [18:47:50] by the crontab with the bogus one [18:47:55] it doesn't matter [18:48:00] it is not a huge loss [18:48:07] but it is very annoying [18:48:25] maybe it is syncing oldversions? [18:52:08] RECOVERY - Check systemd state on elastic2020 is OK: OK - running: The system is fully operational [18:52:59] What I can do is restart it manually and see if it happens again [18:53:25] I will annoy urandom but it will give us more information [18:54:45] heh [18:54:49] jynus: go for it [18:54:57] jynus: annoy at will [18:55:08] PROBLEM - Check systemd state on elastic2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:58:38] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [19:03:11] volans, it is syncing from tegmen to einstinium, but I see no reason for tegmen files to be overwritten [19:03:41] jynus: sorry just got back from dinner [19:04:31] also the downtimes have been there for days or hours, so it cannot be some weird race condition [19:07:25] random theory: you said icinga crashed, if it crashed while re-writing the state file, if it does it as a non-atomic operation, at restart the state file is missing/truncated hence the missing data [19:08:29] crashed? [19:08:36] no, I do not know that [19:08:43] I said it restarted [19:08:59] ah ok, then scratch that [19:09:14] there were some segfault on libc [19:09:20] but on a separate binary [19:09:39] oh, wait [19:09:45] I can see a lot of SIGHUP, restarting when puppet runs [19:09:51] "rsync on icinga-tmpfs/status.dat from einsteinium.wikimedia.org" [19:10:06] mmm [19:10:12] where? [19:10:17] no, that from probably means in the right direction [19:10:37] executed from, not synced from [19:11:08] yeah, no such a logs on einstenium [19:11:24] I could think of something weird but logical [19:11:44] like puppet is disabled on sync on the rsync client [19:12:06] but if rsync and puppet (restart) happens at the same time on client and server [19:12:11] we get something weird [19:12:26] unlikely [19:12:39] but it is the only thing I got now [19:12:52] latest sync happened at 18:33:19 [19:13:35] 33 * * * * /usr/local/sbin/run-no-puppet /usr/local/sbin/sync_icinga_state >/dev/null 2>&1 [19:13:40] on einstenium [19:13:54] yes, that is expected [19:14:10] I am looking at puppet activity on tegment at that time [19:15:02] but the first alarms were a bit before, weren't them? [19:15:28] RECOVERY - Check Varnish expiry mailbox lag on cp2005 is OK: OK: expiry mailbox lag is 206623 [19:15:49] I am refining the times [19:16:33] (03PS1) 10Chad: Scap prep: Save network time by copying data locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351356 [19:16:40] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:17:10] volans jynus looks related https://github.com/Icinga/icinga-core/issues/1006 (but not really sure) [19:17:13] yeah, alert at 18:21:34 [19:17:42] restart 10 seconds before [19:18:16] and puppet runs at 19 + splay [19:18:28] yeah, it was finishing at the time [19:18:34] so I do not see the sync as the cause [19:18:42] already? or was the reload from puppet triggering it? [19:18:45] I will now look at what puppet was doing [19:19:44] caching catalog at 18:20:28 [19:20:05] a service or host change [19:20:05] May 2 18:21:11 tegmen puppet-agent[49918]: (/Stage[main]/Icinga/Service[icinga]) Triggered 'refresh' from 2 events [19:20:10] May 2 18:21:11 tegmen icinga: Caught SIGHUP, restarting... [19:20:15] and yes, that [19:20:29] but that is expected [19:20:31] but the restart seems common [19:20:46] although I though we were reloading icinga, not restarting it [19:20:50] except it seems to be the cause [19:21:00] but maybe because of systemd we restart? [19:21:05] yeah, people waas saying the same [19:21:27] maybe we do not have so frequent changes and that is actually what happens [19:21:39] it is not random downtime drops [19:21:51] they are dropped every time it restarts with a new config [19:23:07] I am going to add what it is not [19:23:58] maybe when we add hosts? [19:24:19] looking at puppet log [19:24:46] the 3rd last puppet run had a diff [19:25:01] with +define host [19:25:04] I can look at the archives [19:25:05] 06Operations, 10Icinga, 10Monitoring: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3229063 (10jcrespo) We discarded sync-related, this happened at 18:21 and the sync happens much later, at 18:33 (and in the right direction, tegmen -> einstenium).... [19:25:19] for the new dbs [19:25:21] last time it happened was May 1, and I have the timestamp [19:26:15] (03PS1) 10Anomie: Add an additional SSH key for anomie [puppet] - 10https://gerrit.wikimedia.org/r/351362 [19:26:23] but we don't have puppet logs [19:26:30] we do [19:26:41] I call them syslogs :-) [19:26:52] yes but without puppet output IIRC [19:27:05] are you sure? [19:27:40] the SIGHUP is normal [19:27:49] it happens all the time [19:28:26] sure about what? [19:28:34] that we do not have puppet logs [19:28:58] puppet outputs are in puppet.log [19:29:05] but there are only N runs AFAIK [19:29:11] so May 1 15:21:06 there is an icinga config errlr [19:29:27] I am only looking syslog right now [19:29:49] which has puppet too, AFAICS [19:31:01] it shows you the puppet runs, but not the puppet output with the diffs of the icinga files [19:31:13] ok, I get you now [19:31:19] probably [19:31:48] but that may be enough, I can see puppet_hosts.cfg was changed [19:31:54] ok [19:32:10] no I have to demonstrate the opposite [19:32:22] that it doesn't happen when those are not touched [19:33:32] eheheh [19:33:35] wanna help? [19:33:41] nah, I see changes there all the time [19:33:53] and actually, I can see the diffs [19:34:27] well, it is a diff, but maybe not a totally new host [19:35:49] "icinga: Auto-save of retention data completed successfully." [19:36:19] but again, that happens every time [19:36:36] (03PS2) 10Thcipriani: WIP: scap: Add a scap::master profile [puppet] - 10https://gerrit.wikimedia.org/r/351179 [19:41:25] (03PS3) 10Thcipriani: WIP: scap: Add a scap::master profile [puppet] - 10https://gerrit.wikimedia.org/r/351179 [19:44:55] (03PS1) 10Thcipriani: l10nupdate: don't run during deployment freeze [puppet] - 10https://gerrit.wikimedia.org/r/351365 [19:49:09] (03PS1) 10Ayounsi: Add more frack hosts to Smokeping [puppet] - 10https://gerrit.wikimedia.org/r/351367 [19:56:41] (03CR) 10Thcipriani: "Puppet compiler does what I'd expect: https://puppet-compiler.wmflabs.org/6270/" [puppet] - 10https://gerrit.wikimedia.org/r/351179 (owner: 10Thcipriani) [19:56:44] (03CR) 10Ayounsi: [C: 032] Add more frack hosts to Smokeping [puppet] - 10https://gerrit.wikimedia.org/r/351367 (owner: 10Ayounsi) [19:58:52] (03PS1) 10Dzahn: admin: add new ed25519 ssh key for myself (dzahn) [puppet] - 10https://gerrit.wikimedia.org/r/351368 [20:04:57] !log Restarting Jenkins for plugin rollback [20:04:57] (03CR) 10Dzahn: [C: 032] "hey, it's me. i GPG signed the commit message :)" [puppet] - 10https://gerrit.wikimedia.org/r/351368 (owner: 10Dzahn) [20:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:18] (03PS2) 10Dzahn: admin: add new ed25519 ssh key for myself (dzahn) [puppet] - 10https://gerrit.wikimedia.org/r/351368 [20:09:15] (03PS1) 10Madhuvishy: sge: Fix global config handling [puppet] - 10https://gerrit.wikimedia.org/r/351379 (https://phabricator.wikimedia.org/T162955) [20:10:13] (03CR) 10Madhuvishy: sge: Fix global config handling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/351379 (https://phabricator.wikimedia.org/T162955) (owner: 10Madhuvishy) [20:16:15] (03PS1) 10Urbanecm: Allow page move only autopatrolled at hiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351382 (https://phabricator.wikimedia.org/T164239) [20:18:50] (03CR) 10Dzahn: "thanks Gnome :/. bug from 2014 open, seahorse (ssh-add) doesn't support it. had to be reminded of https://bugzilla.gnome.org/show_bug.cgi?" [puppet] - 10https://gerrit.wikimedia.org/r/351368 (owner: 10Dzahn) [20:19:05] 06Operations, 06Labs, 13Patch-For-Review: rebuild tools-grid-master as a large instance - https://phabricator.wikimedia.org/T162955#3229197 (10madhuvishy) a:03madhuvishy [20:20:04] (03CR) 10Dzahn: "gimme a break https://bugzilla.gnome.org/show_bug.cgi?id=723274#c8 "There is no bounty on https://www.bountysource.com/issues/22896549-ca" [puppet] - 10https://gerrit.wikimedia.org/r/351368 (owner: 10Dzahn) [20:22:05] (03PS1) 10Urbanecm: Allow new page patroll for autoconfirmed users on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351385 (https://phabricator.wikimedia.org/T164159) [20:22:08] RECOVERY - Check systemd state on elastic2020 is OK: OK - running: The system is fully operational [20:22:55] (03CR) 10Dzahn: "hah! :)" [puppet] - 10https://gerrit.wikimedia.org/r/351368 (owner: 10Dzahn) [20:36:43] 06Operations, 06Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Update wikitech-static and develop procedures to keep it maintained - https://phabricator.wikimedia.org/T163721#3229240 (10Bawolff) [20:40:28] (03PS4) 10Krinkle: Move contribution tracking config to CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342857 (https://phabricator.wikimedia.org/T147479) (owner: 10Chad) [20:40:34] (03PS5) 10Krinkle: Move contribution tracking config to CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342857 (https://phabricator.wikimedia.org/T147479) (owner: 10Chad) [20:40:59] (03PS6) 10Krinkle: Move contribution tracking config to CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342857 (https://phabricator.wikimedia.org/T147479) (owner: 10Chad) [20:41:47] (03CR) 10Krinkle: "Rebased to resolve conflict with private/ update." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342857 (https://phabricator.wikimedia.org/T147479) (owner: 10Chad) [20:42:00] (03Draft1) 10Paladox: Jenkins: Add noncanon to jenkins proxy site [puppet] - 10https://gerrit.wikimedia.org/r/351391 [20:42:03] (03PS2) 10Paladox: Jenkins: Add noncanon to jenkins proxy site [puppet] - 10https://gerrit.wikimedia.org/r/351391 [20:42:54] hashar ^^ [20:43:21] paladox: well the patch is not the issue :] [20:43:31] ok :) [20:43:44] paladox: the real challeng is figuring out what that nocanon is for and what kind of side effect it will have hehe [20:43:55] do you have that on your Jenkins instance ? [20:44:03] * paladox checks [20:44:37] hashar yep [20:44:38] ProxyPass / http://localhost:8082/ retry=0 nocanon [20:44:51] \o/ state so on the change so :] [20:45:00] note you have a trailing slash, and I am not sure the @prefix does [20:45:10] oh [20:45:15] yep [20:45:24] ProxyPass /ci http://localhost:8080/ci [20:45:26] we dont :( [20:45:36] (03PS3) 10Paladox: Jenkins: Add noncanon to jenkins proxy site [puppet] - 10https://gerrit.wikimedia.org/r/351391 [20:45:47] oh [20:45:58] i wonder if we change that to /ci/ would that break something [20:45:59] found this: Normally, mod_proxy will canonicalise ProxyPassed URLs. But this may be incompatible with some backends, particularly those that make use of PATH_INFO. The optional nocanon keyword suppresses this and passes the URL path "raw" to the backend. Note that this keyword may affect the security of your backend, as it removes the normal limited protection against URL-based attacks [20:46:05] provided by the proxy. [20:46:51] so that means using nocanon affects security [20:46:51] (03CR) 10Hashar: "It works for Paladox on labs. Note @prefix is '/ci' and the doc mentions it should have a trailing slash but we have:" [puppet] - 10https://gerrit.wikimedia.org/r/351391 (owner: 10Paladox) [20:48:33] and there is an AllowEncodedSlashes NoDecode [20:48:40] paladox: yea [20:48:54] i wonder why they recommend using that. [20:49:00] and the trailing slash thing has been an issue before [20:49:16] at least in other CI virtual hosts afair [20:49:56] yup it is a bit messy [20:50:18] hashar i can test https://phabricator.wikimedia.org/T155840#3229233 without the noncanon :) [20:52:02] hashar still works without needing nocanon http://gerrit-jenkins.wmflabs.org/blue/pipelines [20:52:03] :) [20:52:03] anyone from releng can help me verify an error on a deployed wiki? [20:55:00] it is not that important, I will file a bug [20:56:26] jynus: sorry I am half asleep. Maybe twentyafterfour / thcipriani can help digging in log / config? [20:56:37] paladox: I might try nocanon tomorrow [20:56:38] * twentyafterfour reads up [20:56:40] yes, of course, go to sleep [20:56:46] ok :) [20:56:57] jynus: how can I help? [20:57:04] twentyafterfour, it is probably a very trivial question [20:57:16] but I tried to decipher our config files [20:57:31] and cannot even know whwer to start to look [20:57:35] heh, our config files are not very trivial ;) [20:57:40] there is a table, math [20:57:46] on all s3 wikis except one [20:57:51] hmm [20:57:56] sounds strange [20:58:11] I know there are some disabled extensions depending on the wiki [20:58:21] but all 899 but one sounds strange :-) [20:58:31] yeah ... which one is it missing from? [20:58:39] one sec, I lost it [20:58:49] I checked the part I know and it is not a closed or delted wiki [20:58:59] I'd say create the table on the one where it's missing? an extraneous table won't cause any problems right? [20:59:05] wait [20:59:47] ERROR 1146 (42S02) at line 8: Table 'bdwikimedia.math' doesn't exist [21:00:54] that's got to be a real error, e.g. the table should be there, right? Otherwise nothing would be trying to use it [21:01:04] it is an active ,small s3 wikimedia wiki [21:01:20] I was trying to see if the table is optional [21:01:35] because maybe it is ok [21:01:57] that is the part that maybe you could give me some hints "enabled extensions" [21:02:12] per wiki, some easy way to get that? [21:03:00] lets see... [21:03:11] it is not part of core, at least not the current core [21:03:22] so math or mathoid [21:03:24] maybe? [21:03:44] yeah probably the math extension [21:03:51] (03PS1) 10Framawiki: Enable wgCiteResponsiveReferences on ilowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351459 [21:03:55] * twentyafterfour checks [21:04:15] correct: https://phabricator.wikimedia.org/diffusion/EMAT/browse/master/db/math.mysql.sql [21:04:38] yeah it's the math extension [21:04:40] I would suppose that is enabled everwhere [21:04:51] is there a canonical place to confirm that? [21:04:59] please teach me once [21:05:03] and I will not ask again [21:05:04] I think so, which extension is enabled is configured via the UI isn't it? rather than by code? [21:05:06] :-) [21:05:09] jynus: That wiki got created recently. Sounds like soeone forgot. [21:05:12] I'm actually not sure [21:05:16] oh [21:05:25] if it is new, that could be it [21:05:25] * James_F checks. [21:05:40] let me double check [21:05:52] maybe I got the error during creation and it is now there [21:05:56] InitializeSettings.php seems to be the place it's configured...hmm [21:05:57] (03PS2) 10Framawiki: Enable wgCiteResponsiveReferences on ilowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351459 (https://phabricator.wikimedia.org/T164230) [21:05:59] 5020c5c4145524c377ec80c2be1d72dbe5b494c6 [21:06:06] that would be embarrasing [21:06:07] 2017-04-08 [21:06:15] 'wmgUseMath' => [ [21:06:17] 'default' => true, // moved from MW core [21:06:19] 'loginwiki' => false, [21:06:21] 'votewiki' => false, // T61702 [21:06:21] T61702: Examine which extensions are installed on login.wikimedia.org (loginwiki) and vote.wikimedia.org (votewiki) - https://phabricator.wikimedia.org/T61702 [21:06:23] ], [21:06:33] ok, how did you find that? [21:06:40] what is the secret? [21:06:43] :-) [21:06:53] jynus: In operations/mediawiki-config, I did `git log dblists/all.dblist`. :-) [21:07:00] no no [21:07:02] sorry [21:07:07] I meant the config [21:07:20] Oh, twentyafterfour's bit? That's in operations/mediawiki-config, in wmf-config/CommonSettings.php [21:07:28] Is wmgUseX standard? [21:07:40] Yes, ish. That repo needs a good clean out. [21:07:49] this is completely unrelated [21:08:02] jynus: I knew it was either InitializeSettings or CommonSettings [21:08:10] but the other day I was comparing, and our mediawiki config is larger than a large portion of wikis [21:08:16] https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php has 144 references to things starting with "wmguse" [21:08:24] jynus: yeah the config is crazy mess [21:08:32] so maybe the most mecanical parts can be moved to a structured store? [21:08:37] The long-term plan is to move most of it into static… yes. [21:09:05] sorry, what does static mean? [21:09:25] static as apposed to dynamic ... the parts that aren't computed [21:09:29] er [21:09:41] but it is configuration [21:09:45] I'm actually not sure what James meant specifically [21:09:51] We've been slowly moving default extension code ("extension registration") into static files (extension.json). [21:09:56] I think it's going into etcd, right? [21:09:59] ah, I got you [21:10:00] now [21:10:14] Once that's fully done, the next step is to start moving non-default (WMF-specific) config into static files too. [21:10:18] note that etcd is for state [21:10:22] not configuration [21:10:34] Alongside that there's the idea of moving into a config DB to do things more efficiently. [21:10:40] state == is X read only, what dbs are pooled, etc. [21:10:48] right [21:10:48] But at this point I step well back and let RelEng and Ops decide. :-) [21:11:01] James_F, yes, that is want I initially was thinking [21:11:09] for the most structured parts [21:11:25] a query to the db that can be cached for X seconds [21:11:40] or any other store that is less crazy to check and modify :-) [21:11:45] Of course, if your config is dynamic (e.g. "only enable this extension if the wiki has not been edited in the past three weeks" or whatever crazy config someone wants) you'll have issues. [21:11:48] thank you all [21:11:59] so conftool is badly named? it's not actually for config? :D [21:12:11] twentyafterfour: "Hard things in Computer Science". ;-) [21:12:12] well, it can be [21:12:24] James_F: indeed [21:12:25] hashar i now get this warnning on the manage page in jenkins [21:12:27] "It appears that your reverse proxy set up is broken." [21:12:34] i meant it for ops- we use puppet for config (ops level) etcd for state [21:12:47] I see [21:13:22] In our mind, you do not want to move everthing to dynamic configuration [21:13:36] because deploying + CR has advantages [21:13:47] for carefult review, etc. [21:13:51] right [21:14:11] but I literally do a db-eqiad commit every day [21:14:31] and cannot be changed automatically by a bot under X conditions [21:14:48] that is the advantage of that other model [21:14:53] Yeah, it'd be good to save you that pain. [21:14:58] it doesn's substitute [21:15:02] just complement [21:15:13] And as I'm subscribed to that repo, I wouldn't mind losing the 'spam'. ;-) [21:15:17] for specific cases, less or no deploy errs, etc. [21:15:28] James_F, I cannot agree more :-) [21:15:57] also imagine: I have to shutdown a database, I just run ./depool, and that is all [21:16:07] we can even hook it to disable notifications [21:16:19] The future sounds lovely. [21:16:22] that is like heaven to me [21:16:36] it is a very near future, actually [21:16:48] :) [21:16:59] Tim and Krinkle did an effor and it will be used tomorrow for the failback [21:17:03] for a very small subset [21:17:04] (03CR) 10Paladox: "Causes a error message to go on the manage page if nocanon is not set" [puppet] - 10https://gerrit.wikimedia.org/r/351391 (owner: 10Paladox) [21:18:04] Yeah. For DB active/master/rotation I guess there's a bit more work to do? [21:18:31] well, that is a different problem -architecture rather than configuration [21:18:56] and actually it is very easy to setup- the problem is mediawiki support of that [21:19:06] it is not transparent- only one master is assumed, etc. [21:19:08] * James_F nods. [21:19:21] but that is like active-active dc [21:19:35] we will have to pay the price that we didn't the last 15 years :-) [21:19:42] Indeed. [21:20:02] so actually the tables isn't yet there [21:20:27] I can create it, but I will instead create a a ticket and CC Derecks*n [21:20:46] and I can still create it later if he is cool with that [21:22:24] maybe it wasn't him, I will check the log [21:22:52] (03CR) 10Hashar: [C: 04-1] "Pending Jenkins 2.46.2" [puppet] - 10https://gerrit.wikimedia.org/r/351261 (owner: 10Hashar) [21:25:21] strange, the wiki request is from 2011, maybe the list was fixed recently, but the wiki existed for a long time? [21:26:05] yes, this is a very old bug [21:26:57] renaming wikis? wow [21:27:03] what? [21:27:25] sorry, i got confused. ignore that :) [21:27:35] just because you said "very old bug" [21:27:43] and wiki request [21:27:45] ha [21:27:52] that is a different old bug [21:27:57] ok , yep :) [21:28:23] it seems to me that Bangladesh user groupd doesn't use the math extension much [21:30:35] confirmed it doesn't exist anywhere, so creating on the master [21:31:43] (03PS1) 10Chad: Gerrit: Finish replication prep [puppet] - 10https://gerrit.wikimedia.org/r/351520 [21:33:11] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Finish replication prep [puppet] - 10https://gerrit.wikimedia.org/r/351520 (owner: 10Chad) [21:33:20] !log creating missing math table on bdwikimedia (s3) [21:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:58] (03PS2) 10Chad: Gerrit: Finish replication prep [puppet] - 10https://gerrit.wikimedia.org/r/351520 [21:44:10] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/6271/" [puppet] - 10https://gerrit.wikimedia.org/r/351520 (owner: 10Chad) [21:47:37] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T152525" [puppet] - 10https://gerrit.wikimedia.org/r/351520 (owner: 10Chad) [21:48:35] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: setup/install gerrit2001/WMF6408 - https://phabricator.wikimedia.org/T152525#3229484 (10Dzahn) Gerrit: **Finish replication prep** - https://gerrit.wikimedia.org/r/#/c/351520/ has been deployed. [21:52:09] !log running previously failed alter tables on s3-eqiad T163912 [21:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:18] T163912: Convert unique keys into primary keys for some wiki tables on s3-eqiad - https://phabricator.wikimedia.org/T163912 [21:52:21] 06Operations, 06Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Update wikitech-static and develop procedures to keep it maintained - https://phabricator.wikimedia.org/T163721#3229493 (10Dzahn) >>! In T163721#3227925, @Andrew wrote: > We haven't designated a specific person in charge, but maybe this ca... [21:52:56] (03PS1) 10Chad: Gerrit: Go ahead and apply gerrit role to new slave in codfw [puppet] - 10https://gerrit.wikimedia.org/r/351525 [21:54:15] mutante: Running compiler ^ [21:55:35] (03CR) 10Chad: "Compiled. No changes on cobalt, applies all roles/files to gerrit2001." [puppet] - 10https://gerrit.wikimedia.org/r/351525 (owner: 10Chad) [21:56:30] ok, cool [21:57:29] (03PS2) 10Dzahn: Gerrit: Go ahead and apply gerrit role to new slave in codfw [puppet] - 10https://gerrit.wikimedia.org/r/351525 (owner: 10Chad) [21:57:45] (03CR) 10Dzahn: [C: 032] Gerrit: Go ahead and apply gerrit role to new slave in codfw [puppet] - 10https://gerrit.wikimedia.org/r/351525 (owner: 10Chad) [21:59:13] RainbowSprinkles: merged on master, you want the puppet run to watch? [21:59:16] Yeah [21:59:20] go ahead [22:00:44] Most things applied [22:00:48] Inspecting failures [22:01:36] Bacula isn't happy [22:01:37] Hmm [22:02:18] PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:02:22] eh, that's unusual, what does it say [22:02:31] i can check bconsole and stuff [22:02:34] Notice: /Stage[main]/Bacula::Client/Base::Expose_puppet_certs[/etc/bacula]/Exec[create-/etc/bacula-keypair]/returns: sh: 1: cannot create /etc/bacula/ssl/server-keypair.pem: Directory nonexistent [22:02:39] Then a bunch of cascading failures [22:02:45] uhm.. new to me [22:02:47] But I shouldn't be installing the client on the slave [22:02:55] if $bacula != undef and !$slave { [22:03:18] PROBLEM - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[create-/etc/bacula-keypair],Service[bacula-fd],Service[gerrit] [22:03:32] Yes yes [22:03:33] I know [22:03:37] that bacula issue sounds related to our attempts to reuse puppet certs for everything [22:03:44] vs having multiple certs [22:04:02] oh,.. that [22:04:28] PROBLEM - Check Varnish expiry mailbox lag on cp2005 is CRITICAL: CRITICAL: expiry mailbox lag is 640630 [22:05:45] Directory nonexistent is that it is trying to create a file but a dir hasn't (puppet doesn't autocreate parent dirs automatically) [22:05:51] i had a problem with it being if $bacula != undef and !$slave { when the class was converted to profile [22:06:15] jynus: Yeah, but weird part is I'm provisioning a slave, it shouldn't be trying to install bacula [22:07:28] Ahhhh [22:07:32] Running it two more times worked [22:07:38] Must be a dependency issue in bacula somewhere [22:08:09] i suspect there was a more recent change to the way it reuses puppet certs or we would have noticed that more [22:08:17] but good [22:09:20] mutante: Can you ack those ^ [22:09:26] It'll be a bit before they're 100% [22:10:09] maybe https://gerrit.wikimedia.org/r/#/c/344606/ [22:10:19] sure, yea [22:11:07] ACKNOWLEDGEMENT - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn setup ongoing [22:11:07] ACKNOWLEDGEMENT - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Service[gerrit] daniel_zahn setup ongoing [22:14:15] (03CR) 10Dzahn: "today while Chad worked on https://phabricator.wikimedia.org/T152525 he noticed this:" [puppet] - 10https://gerrit.wikimedia.org/r/344606 (https://phabricator.wikimedia.org/T161281) (owner: 10Alexandros Kosiaris) [22:15:40] PROBLEM - MariaDB Slave Lag: s1 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 608.24 seconds [22:15:40] PROBLEM - MariaDB Slave Lag: s1 on db1066 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 609.54 seconds [22:15:40] PROBLEM - MariaDB Slave Lag: s1 on db1073 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 609.57 seconds [22:15:40] PROBLEM - SSH access on gerrit2001 is CRITICAL: connect to address 208.80.153.106 and port 29418: Connection refused [22:15:40] PROBLEM - MariaDB Slave Lag: s1 on db1055 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 610.40 seconds [22:15:41] PROBLEM - MariaDB Slave Lag: s1 on db1067 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 610.41 seconds [22:15:41] PROBLEM - MariaDB Slave Lag: s1 on db1051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 610.49 seconds [22:15:42] PROBLEM - MariaDB Slave Lag: s1 on db1080 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 612.01 seconds [22:15:44] PROBLEM - MariaDB Slave Lag: s1 on db1083 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 612.01 seconds [22:15:50] PROBLEM - MariaDB Slave Lag: s1 on db1065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 617.42 seconds [22:15:57] mmm, that is new [22:16:06] i was about to ask if you wre aware of that, heh [22:16:10] PROBLEM - MariaDB Slave Lag: s1 on db1089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 639.47 seconds [22:16:10] PROBLEM - MariaDB Slave Lag: s1 on db1072 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 642.34 seconds [22:16:10] PROBLEM - MariaDB Slave Lag: s1 on db1052 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 642.51 seconds [22:16:20] PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [22:16:22] there is lag on the eqiad master [22:17:12] interesting [22:17:23] totally unexpected, but it is not creating user issues [22:17:44] (those things are risky, that is why we are doing it while it is depooled) [22:17:54] (03CR) 10Chad: [C: 04-1] Scap prep: Save network time by copying data locally (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351356 (owner: 10Chad) [22:18:11] It is "alter table user_properties drop key user_properties_user_property" [22:18:30] robh, if you see that starting tomorrow, you would be right to get scared [22:18:33] :-) [22:19:22] I am not going to do nothing, it should fix itself in 30 minutes [22:20:00] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 873.26 seconds [22:20:04] but again, if this happens for eqiad in the future, do not doubt and call me [22:21:01] strange because I did the exact same thing for all other shards, and only s1 gave this error [22:21:25] oh, I see why [22:21:37] metadata lock creating a race condition [22:21:40] RECOVERY - SSH access on gerrit2001 is OK: SSH OK - GerritCodeReview_2.13.4-13-gc0c5cc4742 (SSHD-CORE-1.2.0) (protocol 2.0) [22:23:18] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: setup/install gerrit2001/WMF6408 - https://phabricator.wikimedia.org/T152525#3229622 (10Dzahn) We need to allow SSH between both servers for clustering, just like for Phabricator in T137928#2565556. [https://gerrit.wikimedia.org/r/#/c/... [22:24:20] RECOVERY - puppet last run on gerrit2001 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [22:24:20] RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational [22:24:20] RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [22:24:59] <_joe_> win 19 [22:31:37] (03PS1) 10Dzahn: gerrit: ferm rules to allow ssh between servers for clustering [puppet] - 10https://gerrit.wikimedia.org/r/351533 (https://phabricator.wikimedia.org/T152525) [22:33:53] (03PS2) 10Dzahn: gerrit: ferm rules to allow ssh between servers for clustering [puppet] - 10https://gerrit.wikimedia.org/r/351533 (https://phabricator.wikimedia.org/T152525) [22:34:23] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/6273/" [puppet] - 10https://gerrit.wikimedia.org/r/351533 (https://phabricator.wikimedia.org/T152525) (owner: 10Dzahn) [22:35:23] (03PS1) 10Jdlrobson: Enable related pages for everyone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351534 (https://phabricator.wikimedia.org/T155079) [22:36:42] 51% progress... [22:38:26] !log gerrit (cobalt/gerrit2001) - deployed firewall change to allow ssh between gerrit servers for clustering, new iptables rules exist now (T152525) [22:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:34] T152525: setup/install gerrit2001/WMF6408 - https://phabricator.wikimedia.org/T152525 [22:39:19] (03CR) 10Dzahn: "cobalt/gerrit2001 each:" [puppet] - 10https://gerrit.wikimedia.org/r/351533 (https://phabricator.wikimedia.org/T152525) (owner: 10Dzahn) [22:39:23] Hmmm [22:39:26] Can't seem to ssh [22:39:27] times out [22:39:37] RainbowSprinkles: so i kind of expected this.. needs 2 steps [22:39:44] since i found the ticket when we did the same for phab [22:39:56] after ferm now needs #netops for ACL [22:40:14] it would be like https://phabricator.wikimedia.org/T143363 [22:40:58] https://phabricator.wikimedia.org/T152525#3229622 [22:41:43] no wait.. still looking [22:42:19] Oh, there it goes [22:42:24] It's asking for a password now, which is wrong [22:42:27] But closer [22:42:38] pheeww.. i like to hear that [22:42:53] that old ticket where we got the same thing done for phab was such a rabbit hole [22:43:03] where multiple things were wrong [22:43:10] but then ok :) [22:43:24] and forget everything about "ACL" i said [22:43:29] It says id_rsa doesn't exist [22:43:30] Lies [22:44:16] something is not ok, there is lag on all s1 eqiad hosts, but not on labsdbs [22:44:39] It's also farrrr too slow. Something's not right... [22:45:46] RainbowSprinkles: try ssh -4 [22:46:17] way faster, right [22:46:26] Much faster yeah [22:46:36] Ok, good enough for me. Still, why can't it find that id_rsa file? [22:46:58] jynus: between 10:01 and 10:11 db1052 bytes sent went to almost zero [22:47:29] RainbowSprinkles: can you paste the whole thing [22:47:34] Yeah [22:48:03] also, i can fix v6 [22:48:25] https://phabricator.wikimedia.org/P5361 [22:48:32] volans, yes, I am expecting that [22:48:34] (identical trying to go the other direction) [22:48:40] running alter? [22:48:42] what I do not expect is labs being up to date [22:48:54] it was supposed to be online, but yes [22:49:03] sorry, didn't read backlog [22:49:16] s1 master is blocked [22:49:23] that is expected [22:49:35] not planned, but normal [22:49:46] but labsdb1001 is saying it has no lag [22:50:03] https://tools.wmflabs.org/replag/ [22:50:13] I can see it flapping on tendril tree [22:50:24] that is heartbeat [22:50:35] it is not the local lag, it compares to the clock [22:51:37] something is writing "2017-05-02T22:51:10.001390" to the heartbeat table [22:54:51] (03PS1) 10Volans: wmf-config: readonly is set in etcd now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351539 (https://phabricator.wikimedia.org/T156924) [22:55:00] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 290.05 seconds [22:55:27] (03CR) 10Volans: [C: 04-2] "I74b332c6c4af72085bd21479009ff2f2dadbb9eb needs to go first" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351539 (https://phabricator.wikimedia.org/T156924) (owner: 10Volans) [22:56:07] more strange recovery order [22:56:10] RECOVERY - MariaDB Slave Lag: s1 on db1089 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [22:56:20] RECOVERY - MariaDB Slave Lag: s1 on db1072 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [22:56:20] RECOVERY - MariaDB Slave Lag: s1 on db1052 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [22:56:40] RECOVERY - MariaDB Slave Lag: s1 on db1066 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [22:56:40] RECOVERY - MariaDB Slave Lag: s1 on db1051 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [22:56:40] RECOVERY - MariaDB Slave Lag: s1 on db1067 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [22:56:40] RECOVERY - MariaDB Slave Lag: s1 on db1055 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [22:56:40] RECOVERY - MariaDB Slave Lag: s1 on db1073 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [22:56:41] RECOVERY - MariaDB Slave Lag: s1 on db1080 is OK: OK slave_sql_lag Replication lag: 0.28 seconds [22:56:41] RECOVERY - MariaDB Slave Lag: s1 on db1083 is OK: OK slave_sql_lag Replication lag: 0.42 seconds [22:56:49] RainbowSprinkles: the client side is like "no such file or directory" but the server side "failed public key". still debugging [22:56:50] RECOVERY - MariaDB Slave Lag: s1 on db1065 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [22:57:10] <_joe_> !log upgrading python-conftool across the fleet [22:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:40] RECOVERY - MariaDB Slave Lag: s1 on db1047 is OK: OK slave_sql_lag Replication lag: 0.34 seconds [23:00:42] !log locking scap on naos for deployment of EtcdConfig https://gerrit.wikimedia.org/r/#/c/351132/ [23:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:22] (03PS3) 10Tim Starling: Enable EtcdConfig in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351132 (https://phabricator.wikimedia.org/T156924) [23:02:42] (03CR) 10Tim Starling: [C: 032] Enable EtcdConfig in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351132 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [23:03:49] (03Merged) 10jenkins-bot: Enable EtcdConfig in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351132 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [23:05:49] (03CR) 10jenkins-bot: Enable EtcdConfig in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351132 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [23:05:54] hmm [23:05:58] i am now getting this error [23:05:59] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find data item gerrit::servers in any Hiera data file and no default supplied at /etc/puppet/modules/profile/manifests/gerrit/server.pp:50 on node gerrit-test3.git.eqiad.wmflabs [23:05:59] Warning: Not using cache on failed catalog [23:05:59] Error: Could not retrieve catalog; skipping run [23:06:03] mutante ^^ [23:06:18] it was working fine up until now [23:07:46] !log scap pull on mw2017 and mwdebug1001 for etcd testing [23:07:48] 06Operations, 10ops-ulsfo, 10Traffic: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3229950 (10RobH) [23:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:58] paladox: ah, thank you. could you do me a favor and add that [23:08:14] (03PS1) 10Chad: Gerrit: Fix SSH key authorization [puppet] - 10https://gerrit.wikimedia.org/r/351543 [23:08:17] oh, what do i put in it? [23:08:27] paladox: it's because i did https://gerrit.wikimedia.org/r/#/c/351533/2/hieradata/role/common/gerrit/server.yaml [23:08:28] ah [23:08:30] gerrit::servers: [23:08:30] 14 - cobalt.wikimedia.org [23:08:30] 15 - gerrit2001.wikimedia.org [23:08:37] paladox: in Labs hiera, but with the labs server names [23:08:43] yep [23:09:07] paladox: it allows them to talk ssh to each other for gerrit clustering support [23:09:24] oh [23:10:21] works now, thanks :) [23:11:42] (03CR) 10Dzahn: [C: 032] Gerrit: Fix SSH key authorization [puppet] - 10https://gerrit.wikimedia.org/r/351543 (owner: 10Chad) [23:11:52] paladox: thank you too, i should have thought about it [23:11:59] :) [23:12:08] we were still debugging that [23:13:22] (03CR) 10Volans: wmf-config: readonly is set in etcd now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351539 (https://phabricator.wikimedia.org/T156924) (owner: 10Volans) [23:14:03] (03PS2) 10Tim Starling: wmf-config: readonly is set in etcd now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351539 (https://phabricator.wikimedia.org/T156924) (owner: 10Volans) [23:14:08] (03CR) 10Tim Starling: [C: 032] wmf-config: readonly is set in etcd now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351539 (https://phabricator.wikimedia.org/T156924) (owner: 10Volans) [23:14:25] mutante it's now causing the icinga checks to fail for me now. [23:14:30] RECOVERY - Check Varnish expiry mailbox lag on cp2005 is OK: OK: expiry mailbox lag is 0 [23:14:40] paladox: what check exactly [23:14:49] puppet, check user, apt [23:14:58] PROBLEM - check users on gerrit.git.wmflabs.org is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:15:02] i dont see how that would be related [23:15:06] so far [23:15:26] it may have erased my port i opened (5666) [23:15:40] how did you open it [23:15:40] (03Merged) 10jenkins-bot: wmf-config: readonly is set in etcd now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351539 (https://phabricator.wikimedia.org/T156924) (owner: 10Volans) [23:15:49] (03CR) 10jenkins-bot: wmf-config: readonly is set in etcd now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351539 (https://phabricator.wikimedia.org/T156924) (owner: 10Volans) [23:15:58] i used /sbin/iptables -A INPUT -p tcp -d 0/0 -s 0/0 --dport 5666 -j ACCEPT [23:16:27] recovery when i run that command [23:16:38] paladox: ok, manual iptables changes are going to be removed by puppet when ferm rules are added and the service restarts [23:16:40] PROBLEM - HHVM rendering on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:16:47] oh [23:16:50] PROBLEM - Apache HTTP on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:16:57] paladox: you shouldn't do it manually but with security groups in horizon or something [23:17:00] PROBLEM - Nginx local proxy to apache on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:17:05] _joe_: there you go ^^^ :D [23:17:23] Oh, i've done that but it dosent seem to work [23:17:47] !log tstarling@puppetmaster1001 conftool action : set/@read-only.yaml; selector: name=ReadOnly,scope=eqiad [23:17:48] paladox: we'll check more later, still also debugging the prod thing [23:17:50] RECOVERY - Apache HTTP on mwdebug1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 9.489 second response time [23:17:50] RECOVERY - Nginx local proxy to apache on mwdebug1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 620 bytes in 2.466 second response time [23:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:57] ok [23:18:01] <_joe_> volans: I removed the -j DROP now [23:18:17] I will manually edit puppet master to add the port [23:18:40] RECOVERY - HHVM rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 200 OK - 73584 bytes in 0.282 second response time [23:20:01] <_joe_> I just moved to use REJECT [23:21:11] (03Draft1) 10Paladox: DO NOT MERGE [puppet] - 10https://gerrit.wikimedia.org/r/351546 [23:21:12] (03PS2) 10Paladox: DO NOT MERGE [puppet] - 10https://gerrit.wikimedia.org/r/351546 [23:21:13] 06Operations, 10DBA: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789#3229989 (10jcrespo) I am testing mydumper (and myloader) on db1015. That is one of our slowest hosts, although s3 should be favorable for an efficient dump. Taking a full s3 ba... [23:22:01] PROBLEM - Nginx local proxy to apache on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:22:51] RECOVERY - Nginx local proxy to apache on mwdebug1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.231 second response time [23:24:42] (03PS1) 10Dzahn: gerrit: also allow ssh via IPv6 between servers [puppet] - 10https://gerrit.wikimedia.org/r/351547 (https://phabricator.wikimedia.org/T152525) [23:25:16] (03PS2) 10Dzahn: gerrit: also allow ssh via IPv6 between servers [puppet] - 10https://gerrit.wikimedia.org/r/351547 (https://phabricator.wikimedia.org/T152525) [23:25:37] (03CR) 10Paladox: "will this break on labs?" [puppet] - 10https://gerrit.wikimedia.org/r/351547 (https://phabricator.wikimedia.org/T152525) (owner: 10Dzahn) [23:26:42] PROBLEM - HHVM rendering on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:26:50] PROBLEM - Apache HTTP on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:27:00] PROBLEM - Nginx local proxy to apache on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:27:26] (03CR) 10Dzahn: [C: 032] "no IPv6 in labs unfortunately https://phabricator.wikimedia.org/T37947 but it should not break it either" [puppet] - 10https://gerrit.wikimedia.org/r/351547 (https://phabricator.wikimedia.org/T152525) (owner: 10Dzahn) [23:30:30] RECOVERY - HHVM rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 200 OK - 73545 bytes in 0.289 second response time [23:30:40] RECOVERY - Apache HTTP on mwdebug1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.148 second response time [23:30:50] RECOVERY - Nginx local proxy to apache on mwdebug1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.230 second response time [23:30:55] (03PS1) 10Chad: Add at least a baseline scap.cfg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351548 [23:33:29] (03PS1) 10Tim Starling: Revert "Enable EtcdConfig in production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351550 [23:33:36] (03PS1) 10Tim Starling: Revert "wmf-config: readonly is set in etcd now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351551 [23:34:13] (03CR) 10Tim Starling: [C: 032] "The configured timeout is not working, and the APC lock is apparently not working either." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351550 (owner: 10Tim Starling) [23:34:21] (03CR) 10Tim Starling: [C: 032] Revert "wmf-config: readonly is set in etcd now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351551 (owner: 10Tim Starling) [23:35:11] (03Merged) 10jenkins-bot: Revert "Enable EtcdConfig in production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351550 (owner: 10Tim Starling) [23:35:42] (03CR) 10jenkins-bot: Revert "Enable EtcdConfig in production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351550 (owner: 10Tim Starling) [23:36:23] (03Merged) 10jenkins-bot: Revert "wmf-config: readonly is set in etcd now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351551 (owner: 10Tim Starling) [23:37:37] (03CR) 10jenkins-bot: Revert "wmf-config: readonly is set in etcd now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351551 (owner: 10Tim Starling) [23:42:17] !log EtcdConfig changes all reverted [23:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:41] (03PS1) 10Volans: Deployment server: do not touch lock file on master [puppet] - 10https://gerrit.wikimedia.org/r/351555 [23:50:24] (03CR) 10Giuseppe Lavagetto: [C: 031] Deployment server: do not touch lock file on master [puppet] - 10https://gerrit.wikimedia.org/r/351555 (owner: 10Volans) [23:51:01] (03CR) 10jerkins-bot: [V: 04-1] Deployment server: do not touch lock file on master [puppet] - 10https://gerrit.wikimedia.org/r/351555 (owner: 10Volans) [23:51:24] jenkins, is late, give me a break :D [23:51:41] lol pythonism [23:52:06] (03PS2) 10Volans: Deployment server: do not touch lock file on master [puppet] - 10https://gerrit.wikimedia.org/r/351555 [23:54:50] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: setup/install gerrit2001/WMF6408 - https://phabricator.wikimedia.org/T152525#3230090 (10Dzahn) after some debug: We need to move the ssh public key from gerrit2's home dir to /etc/ssh/userkeys/ to make ssh work for replication, we shou... [23:56:34] (03CR) 10Volans: [C: 032] Deployment server: do not touch lock file on master [puppet] - 10https://gerrit.wikimedia.org/r/351555 (owner: 10Volans) [23:56:48] (03CR) 10Volans: [C: 032] "Puppet compiler is sane: https://puppet-compiler.wmflabs.org/6274/" [puppet] - 10https://gerrit.wikimedia.org/r/351555 (owner: 10Volans)