[00:39:57] (03PS2) 10Andrew Bogott: Switch primary designate host to labservices1001. [puppet] - 10https://gerrit.wikimedia.org/r/254489 (https://phabricator.wikimedia.org/T106303) [00:39:58] PROBLEM - puppet last run on db2040 is CRITICAL: CRITICAL: puppet fail [00:39:59] (03PS1) 10Andrew Bogott: Don't hardcode the local host IP in designate.conf. [puppet] - 10https://gerrit.wikimedia.org/r/254786 [00:41:47] (03PS5) 10Andrew Bogott: Rename holmium to labservices1002. [puppet] - 10https://gerrit.wikimedia.org/r/254465 [00:42:54] (03Abandoned) 10Andrew Bogott: Rename holmium to labservices1002. [dns] - 10https://gerrit.wikimedia.org/r/254463 (https://phabricator.wikimedia.org/T106303) (owner: 10Andrew Bogott) [01:00:19] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [01:01:19] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [01:01:29] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [01:04:40] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: puppet fail [01:04:59] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: puppet fail [01:05:06] (03PS1) 10Andrew Bogott: Rename the labsdns role to labs::dns [puppet] - 10https://gerrit.wikimedia.org/r/254789 [01:06:22] (03PS2) 10Andrew Bogott: Rename the labsdns role to role::labs::dns [puppet] - 10https://gerrit.wikimedia.org/r/254789 [01:07:50] (03CR) 10Andrew Bogott: [C: 032] Rename the labsdns role to role::labs::dns [puppet] - 10https://gerrit.wikimedia.org/r/254789 (owner: 10Andrew Bogott) [01:08:18] RECOVERY - puppet last run on db2040 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:09:39] (03PS2) 10Andrew Bogott: Don't hardcode the local host IP in designate.conf. [puppet] - 10https://gerrit.wikimedia.org/r/254786 [01:09:47] (03PS3) 10Andrew Bogott: Switch primary designate host to labservices1001. [puppet] - 10https://gerrit.wikimedia.org/r/254489 (https://phabricator.wikimedia.org/T106303) [01:10:19] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:10:38] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:14:01] (03CR) 10Andrew Bogott: [C: 032] Don't hardcode the local host IP in designate.conf. [puppet] - 10https://gerrit.wikimedia.org/r/254786 (owner: 10Andrew Bogott) [01:17:23] (03PS4) 10Andrew Bogott: Switch primary designate host to labservices1001 [puppet] - 10https://gerrit.wikimedia.org/r/254489 (https://phabricator.wikimedia.org/T106303) [01:17:25] (03PS6) 10Andrew Bogott: Rename holmium to labservices1002 [puppet] - 10https://gerrit.wikimedia.org/r/254465 [01:17:27] (03PS1) 10Andrew Bogott: Remove hard-coded host ip for pool_nameserver, take two [puppet] - 10https://gerrit.wikimedia.org/r/254791 [01:18:09] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [01:18:14] (03CR) 10jenkins-bot: [V: 04-1] Rename holmium to labservices1002 [puppet] - 10https://gerrit.wikimedia.org/r/254465 (owner: 10Andrew Bogott) [01:19:08] (03CR) 10Andrew Bogott: [C: 032] Remove hard-coded host ip for pool_nameserver, take two [puppet] - 10https://gerrit.wikimedia.org/r/254791 (owner: 10Andrew Bogott) [01:25:29] (03CR) 10Andrew Bogott: [C: 032] Switch primary designate host to labservices1001 [puppet] - 10https://gerrit.wikimedia.org/r/254489 (https://phabricator.wikimedia.org/T106303) (owner: 10Andrew Bogott) [01:26:59] !log end enwiki test election from cli to avoid confusion with real election [01:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:32:33] (03PS7) 10Andrew Bogott: Rename holmium to labservices1002 [puppet] - 10https://gerrit.wikimedia.org/r/254465 [01:32:35] (03PS1) 10Andrew Bogott: labs_designate_hostname: "labservices1001.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/254792 (https://phabricator.wikimedia.org/T106303) [01:33:22] (03CR) 10jenkins-bot: [V: 04-1] Rename holmium to labservices1002 [puppet] - 10https://gerrit.wikimedia.org/r/254465 (owner: 10Andrew Bogott) [01:36:35] (03CR) 10Andrew Bogott: [C: 032] labs_designate_hostname: "labservices1001.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/254792 (https://phabricator.wikimedia.org/T106303) (owner: 10Andrew Bogott) [01:36:59] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 1 failures [01:44:29] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [02:19:39] !log l10nupdate@tin Synchronized php-1.27.0-wmf.7/cache/l10n: l10nupdate for 1.27.0-wmf.7 (duration: 05m 57s) [02:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:29:39] PROBLEM - designate-central process on holmium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/designate-central [02:29:50] PROBLEM - designate-mdns process on holmium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/designate-mdns [02:29:59] PROBLEM - designate-pool-manager process on holmium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/designate-pool-manager [02:30:38] PROBLEM - designate-sink process on holmium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/designate-sink [02:31:18] PROBLEM - designate-api process on holmium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/designate-api [02:37:10] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: puppet fail [02:42:18] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [02:45:49] PROBLEM - DPKG on labservices1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [02:47:58] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [02:48:28] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [03:20:30] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [03:21:19] RECOVERY - Host mw2027 is UP: PING OK - Packet loss = 0%, RTA = 36.88 ms [03:39:48] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [03:48:02] (03PS1) 10Andrew Bogott: Exchange labs-ns2 and labs-ns3 [dns] - 10https://gerrit.wikimedia.org/r/254800 [03:48:44] (03CR) 10Andrew Bogott: [C: 032] Exchange labs-ns2 and labs-ns3 [dns] - 10https://gerrit.wikimedia.org/r/254800 (owner: 10Andrew Bogott) [04:00:29] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [04:11:09] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 12.00% of data above the critical threshold [100000000.0] [04:30:18] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [04:31:18] PROBLEM - puppet last run on mw1120 is CRITICAL: CRITICAL: Puppet has 1 failures [04:31:59] (03PS1) 10Andrew Bogott: Consolidate hiera designate config. [puppet] - 10https://gerrit.wikimedia.org/r/254804 [04:44:18] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [04:51:36] (03CR) 10Andrew Bogott: [C: 032] Consolidate hiera designate config. [puppet] - 10https://gerrit.wikimedia.org/r/254804 (owner: 10Andrew Bogott) [04:53:10] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [04:56:08] RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:01:33] (03PS1) 10Andrew Bogott: Labs dns: Make holmium use the secondary settings. [puppet] - 10https://gerrit.wikimedia.org/r/254806 [05:02:10] (03CR) 10Andrew Bogott: [C: 032] Labs dns: Make holmium use the secondary settings. [puppet] - 10https://gerrit.wikimedia.org/r/254806 (owner: 10Andrew Bogott) [05:06:26] (03PS1) 10Andrew Bogott: role::labs::nova::master -- replace $certificate definition. [puppet] - 10https://gerrit.wikimedia.org/r/254807 [05:08:35] (03CR) 10Andrew Bogott: [C: 032] role::labs::nova::master -- replace $certificate definition. [puppet] - 10https://gerrit.wikimedia.org/r/254807 (owner: 10Andrew Bogott) [05:09:48] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [05:10:59] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [05:11:39] PROBLEM - Recursive DNS on 208.80.155.118 is CRITICAL: CRITICAL - Plugin timed out while executing system call [05:17:49] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [05:18:31] (03PS1) 10Andrew Bogott: Consolidate holmium.yaml and holmium1001.yaml [puppet] - 10https://gerrit.wikimedia.org/r/254808 (https://phabricator.wikimedia.org/T106303) [05:20:52] (03CR) 10Andrew Bogott: [C: 032] Consolidate holmium.yaml and holmium1001.yaml [puppet] - 10https://gerrit.wikimedia.org/r/254808 (https://phabricator.wikimedia.org/T106303) (owner: 10Andrew Bogott) [05:27:29] PROBLEM - Recursive DNS on 208.80.154.20 is CRITICAL: CRITICAL - Plugin timed out while executing system call [05:27:42] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:31:19] RECOVERY - Recursive DNS on 208.80.154.20 is OK: DNS OK: 0.084 seconds response time. www.wikipedia.org returns 208.80.154.224 [05:31:40] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 931517 bytes in 4.149 second response time [05:36:38] (03PS1) 10Andrew Bogott: Assign labs-recursor0 back to holmium and labs-recursor1 back to labservices1001. [puppet] - 10https://gerrit.wikimedia.org/r/254809 [05:39:15] (03CR) 10Andrew Bogott: [C: 032] Assign labs-recursor0 back to holmium and labs-recursor1 back to labservices1001. [puppet] - 10https://gerrit.wikimedia.org/r/254809 (owner: 10Andrew Bogott) [05:43:39] RECOVERY - Recursive DNS on 208.80.155.118 is OK: DNS OK: 0.215 seconds response time. www.wikipedia.org returns 208.80.154.224 [05:46:10] PROBLEM - puppet last run on mw1124 is CRITICAL: CRITICAL: Puppet has 1 failures [05:49:17] (03CR) 10Santhosh: [C: 031] CX: Enable article-recommender-1 campaign as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254117 (https://phabricator.wikimedia.org/T118033) (owner: 10KartikMistry) [05:51:44] 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: rename holmium to labservices1002 - https://phabricator.wikimedia.org/T106303#1824353 (10Andrew) labservices1001 is now the primary host; it should be safe to reboot/reinstall/rename holmium at any time. [06:13:06] RECOVERY - puppet last run on mw1124 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:30:56] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:56] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:06] PROBLEM - puppet last run on wtp2017 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:26] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:35] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:35] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 3 failures [06:31:47] PROBLEM - puppet last run on sca1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:56] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:06] PROBLEM - puppet last run on chromium is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:36] PROBLEM - puppet last run on mw1251 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:45] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:56] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 2 failures [06:52:56] 6operations, 10Deployment-Systems: l10nupdate user uid mismatch between tin and mira - https://phabricator.wikimedia.org/T119165#1824367 (10mmodell) @bd808: so, reading the manpage for rsync, it seems that you have to specify --numeric-ids for this to even matter? Otherwise rsync should be smart enough to rem... [06:56:06] RECOVERY - puppet last run on sca1001 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:56:15] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:56:17] RECOVERY - puppet last run on chromium is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:56:56] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:57:06] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:57:06] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:57:17] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:57:25] RECOVERY - puppet last run on wtp2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:36] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:57:46] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:46] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:46] RECOVERY - puppet last run on mw1251 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:26] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [07:04:46] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 216, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps DWDM]BR [07:06:05] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 35, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps DWDM]BR [07:09:45] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 [07:10:25] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 218, down: 0, dormant: 0, excluded: 0, unused: 0 [07:19:06] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [07:39:19] (03CR) 10TheDJ: "Ppl are still waiting for this patch to be deployed. It was announced a long tie ago..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237330 (https://phabricator.wikimedia.org/T104797) (owner: 10Mdann52) [08:05:05] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:12:25] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 47629 bytes in 2.819 second response time [08:12:58] !log restarted gitblit on antimony [08:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:14:07] (03PS2) 10Giuseppe Lavagetto: pybal: install monitoring [puppet] - 10https://gerrit.wikimedia.org/r/252245 (https://phabricator.wikimedia.org/T102394) [08:21:09] (03CR) 10Santhosh: [C: 04-1] "cxserver with servicerunner removes all parsoid references and use only restbase. So this patch is not needed" [puppet] - 10https://gerrit.wikimedia.org/r/254151 (https://phabricator.wikimedia.org/T111562) (owner: 10KartikMistry) [08:30:35] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [08:47:35] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [08:57:09] mobrovac: I'll start decommissioning restbase1001 shortly FYI [08:57:19] kk godog [08:57:40] !log mathoid deploying ff03d56 [08:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:58:11] !log restbase1001 nodetool decommission [08:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:01:04] 6operations, 10ops-eqiad, 7Swift: [determine] rack ms-be1019-1021 - https://phabricator.wikimedia.org/T114711#1824512 (10fgiunchedi) see also {T118183} for followup on fully getting the new machines added to swift [09:18:06] PROBLEM - puppet last run on cp3021 is CRITICAL: CRITICAL: puppet fail [09:40:12] (03CR) 10Alexandros Kosiaris: "@kartikMistry, abandon per santhosh's comment?" [puppet] - 10https://gerrit.wikimedia.org/r/254151 (https://phabricator.wikimedia.org/T111562) (owner: 10KartikMistry) [09:44:56] RECOVERY - puppet last run on cp3021 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [09:50:10] (03PS5) 10Alexandros Kosiaris: Add rubocop and 'test' target to Rakefile [puppet] - 10https://gerrit.wikimedia.org/r/252686 (https://phabricator.wikimedia.org/T117993) (owner: 10Zfilipin) [09:50:15] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 624 [09:54:44] (03CR) 10ArielGlenn: [C: 032 V: 032] start work on cleaning up 'chunks' handling [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/252131 (owner: 10ArielGlenn) [09:55:15] RECOVERY - check_mysql on db1008 is OK: Uptime: 833442 Threads: 2 Questions: 13471166 Slow queries: 5586 Opens: 25022 Flush tables: 2 Open tables: 64 Queries per second avg: 16.163 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:56:03] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: clean up construction of list of possible dump jobs for wiki [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/252132 (owner: 10ArielGlenn) [09:57:26] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: puppet fail [10:01:23] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: clean up many comments of methods for dumps jobs [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/252133 (owner: 10ArielGlenn) [10:02:20] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: clean up docstrings for recompress jobs [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/252134 (owner: 10ArielGlenn) [10:02:26] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [10:03:17] (03PS1) 10ArielGlenn: dumps: clean up history dumps code, dump only missing pages for reruns [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/254820 [10:10:46] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [5000000.0] [10:17:46] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [10:24:37] RECOVERY - puppet last run on cp3019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:30:06] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 1.00% above the threshold [1000000.0] [10:35:54] (03CR) 10Alexandros Kosiaris: [C: 032] Add rubocop and 'test' target to Rakefile [puppet] - 10https://gerrit.wikimedia.org/r/252686 (https://phabricator.wikimedia.org/T117993) (owner: 10Zfilipin) [10:38:00] (03PS2) 10Alexandros Kosiaris: RuboCop: regenerated TODO file [puppet] - 10https://gerrit.wikimedia.org/r/254427 (owner: 10Zfilipin) [10:38:07] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] RuboCop: regenerated TODO file [puppet] - 10https://gerrit.wikimedia.org/r/254427 (owner: 10Zfilipin) [10:40:19] (03PS2) 10Alexandros Kosiaris: RuboCop: fixed Style/DeprecatedHashMethods offense [puppet] - 10https://gerrit.wikimedia.org/r/254428 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [10:42:38] (03CR) 10Alexandros Kosiaris: [C: 032] RuboCop: fixed Style/DeprecatedHashMethods offense [puppet] - 10https://gerrit.wikimedia.org/r/254428 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [11:05:35] <_joe_> !log stopping apache on mw2051 for testing [11:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:23:20] 6operations, 7Pybal: Pybal IdleConnectionMonitor with TCP KeepAlive shows random fails if more than 100 servers are involved. - https://phabricator.wikimedia.org/T119372#1824728 (10Joe) 3NEW [11:23:41] 6operations, 7Pybal: Pybal IdleConnectionMonitor with TCP KeepAlive shows random fails if more than 100 servers are involved. - https://phabricator.wikimedia.org/T119372#1824746 (10Joe) [11:24:28] (03CR) 10Zfilipin: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/253637 (https://phabricator.wikimedia.org/T118863) (owner: 10Ottomata) [11:33:21] of course that was true this morning, I just forgot to set it :-/ [11:35:19] (03PS7) 10Glaisher: noindex user namespace on en.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237330 (https://phabricator.wikimedia.org/T104797) (owner: 10Mdann52) [11:36:56] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [11:37:37] (03PS1) 10Jhobs: End first mobile QuickSurveys campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254831 (https://phabricator.wikimedia.org/T118881) [11:48:17] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [12:00:37] (03PS1) 10Zfilipin: RuboCop: Fixed Lint/UnusedBlockArgument offense [puppet] - 10https://gerrit.wikimedia.org/r/254833 (https://phabricator.wikimedia.org/T112651) [12:08:09] (03PS1) 10ArielGlenn: dumps: fix up cleanup of old files from previous run [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/254836 [12:08:11] (03PS1) 10ArielGlenn: dumps: new option 'cleanup' to require cleanup when dump rerun [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/254837 [12:09:30] (03PS1) 10Zfilipin: RuboCop: fixed Lint/UnusedMethodArgument offense [puppet] - 10https://gerrit.wikimedia.org/r/254838 (https://phabricator.wikimedia.org/T112651) [12:10:09] (03CR) 10Zfilipin: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/254833 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [12:10:18] (03CR) 10Zfilipin: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/254838 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [12:13:18] (03CR) 10Alexandros Kosiaris: [C: 032] RuboCop: Fixed Lint/UnusedBlockArgument offense [puppet] - 10https://gerrit.wikimedia.org/r/254833 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [12:13:22] 6operations, 10Analytics, 7Database: db1046 running out of disk space - https://phabricator.wikimedia.org/T119380#1824823 (10jcrespo) 3NEW [12:14:18] (03PS1) 10Zfilipin: RuboCop: fixed Style/AlignParameters parameters [puppet] - 10https://gerrit.wikimedia.org/r/254839 (https://phabricator.wikimedia.org/T112651) [12:14:46] (03CR) 10Zfilipin: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/254839 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [12:16:29] 6operations, 6Labs, 10Tool-Labs, 7Database: Replication broken on labsdb1002. - https://phabricator.wikimedia.org/T119315#1824834 (10jcrespo) And this is why your DBA cannot take a single day of vacations. [12:17:45] 6operations, 6Labs, 10Tool-Labs, 7Database: Replication broken on labsdb1002. - https://phabricator.wikimedia.org/T119315#1823353 (10jcrespo) Waiting for replication lag to go back to 0 to close this ticket. [12:18:10] (03PS1) 10Zfilipin: RuboCop: fixed Style/AndOr offense [puppet] - 10https://gerrit.wikimedia.org/r/254841 (https://phabricator.wikimedia.org/T112651) [12:19:11] (03CR) 10Zfilipin: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/254841 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [12:22:14] (03PS2) 10Zfilipin: RuboCop: fixed Style/AndOr offense [puppet] - 10https://gerrit.wikimedia.org/r/254841 (https://phabricator.wikimedia.org/T112651) [12:22:25] (03CR) 10Zfilipin: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/254841 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [12:23:26] (03CR) 10Zfilipin: "Patch set 2 increases Metrics/LineLength from 158 to 159, since this change made one line longer." [puppet] - 10https://gerrit.wikimedia.org/r/254841 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [12:25:55] (03PS1) 10Cenarium: Remove proxyunbannable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254842 (https://phabricator.wikimedia.org/T75414) [12:26:07] PROBLEM - Disk space on wdqs1002 is CRITICAL: DISK CRITICAL - free space: /var/lib/wdqs 7113 MB (3% inode=99%) [12:29:29] (03PS2) 10Alexandros Kosiaris: RuboCop: fixed Style/AlignParameters parameters [puppet] - 10https://gerrit.wikimedia.org/r/254839 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [12:35:26] (03CR) 10Alexandros Kosiaris: [C: 032] RuboCop: fixed Style/AlignParameters parameters [puppet] - 10https://gerrit.wikimedia.org/r/254839 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [12:44:07] 6operations, 7Database, 5Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1824867 (10jcrespo) > AFAIK there is no reason to not have an expiry time With our current setup, it takes 1 year to to a rolling restart of all our database boxes, and probably around 1... [12:49:23] 6operations, 10Dumps-Generation, 10hardware-requests: determine hardware needs for dumps in eqiad (boxes out of warranty, capacity planning) - https://phabricator.wikimedia.org/T118154#1824873 (10ArielGlenn) Just checking in to see if the vendors have come back to us yet. [12:49:45] 6operations, 6Labs, 10Tool-Labs, 7Database: Replication broken on labsdb1002. - https://phabricator.wikimedia.org/T119315#1824876 (10Luke081515) @jcrespo Replag at dewiki and commons is still growing :-( [12:51:52] 6operations, 6Labs, 10Tool-Labs, 7Database: Replication broken on labsdb1002. - https://phabricator.wikimedia.org/T119315#1824878 (10jcrespo) Sadly, after a server crashes, it empties its caches, so I would expect it to be growing for a while. I am monitoring it, just in case there is more issues- will kee... [12:59:13] 6operations, 6Labs, 10Tool-Labs, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1824883 (10jcrespo) a:3jcrespo [13:01:17] 6operations, 6Labs, 10Tool-Labs, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1824895 (10jcrespo) [13:05:30] 6operations, 10Datasets-General-or-Unknown: Sometimes (at peak usage?), dumps.wikimedia.org becomes very slow for users (sometimes unresponsive) - https://phabricator.wikimedia.org/T45647#1824900 (10ArielGlenn) I did not see anything strange happening this month's run, which has now concluded. Can folks on th... [13:09:40] (03PS1) 10Joal: Update monitoring function using graphite [puppet] - 10https://gerrit.wikimedia.org/r/254846 (https://phabricator.wikimedia.org/T116035) [13:10:39] (03CR) 10jenkins-bot: [V: 04-1] Update monitoring function using graphite [puppet] - 10https://gerrit.wikimedia.org/r/254846 (https://phabricator.wikimedia.org/T116035) (owner: 10Joal) [13:12:26] 6operations, 10Salt: salt still has issues with grain selection? - https://phabricator.wikimedia.org/T114937#1824916 (10ArielGlenn) The grain returner message is bogus and can be ignored. -v will only tell you about negative results for non-batch mode I believe. The timeout of 150 is quite long; what happens... [13:18:12] !log restarting mysql on labsdb1003, unresponsive [13:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:18:20] 10Ops-Access-Requests, 6operations: Grant jgirault an jan_drewniak access to the eventlogging db on stat1003 and hive to query webrequests tables on stat1002 - https://phabricator.wikimedia.org/T118998#1824925 (10ArielGlenn) Ping @Jdrewniak; please sign the L3 document mentioned in Andrew's comment above. Thanks! [13:19:44] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for ejegg - https://phabricator.wikimedia.org/T118320#1824927 (10ArielGlenn) That approval is fine. I'll get a patch in today. [13:20:27] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [13:37:50] (03CR) 10Faidon Liambotis: [C: 04-1] "CentralNotice relies on that and is heavily used by fundraising. Let's not apply anything resembling it until after the fundraising is ove" [puppet] - 10https://gerrit.wikimedia.org/r/253619 (https://phabricator.wikimedia.org/T99226) (owner: 10Faidon Liambotis) [13:38:45] 6operations, 6Analytics-Kanban, 10CirrusSearch, 6Discovery, and 2 others: Delete logs on stat1002 in /a/mw-log/archive that are more than 90 days old. - https://phabricator.wikimedia.org/T118527#1824949 (10ArielGlenn) Oldest file in these directories is now Aug 25 (today is Nov 23), so looking good. [13:40:11] 6operations, 10Wiki-Loves-Monuments-General, 10Wikimedia-DNS, 5Patch-For-Review, 7domains: point wikilovesmonument.org ns to wmf - https://phabricator.wikimedia.org/T118468#1824950 (10ArielGlenn) p:5Triage>3Normal [13:41:05] 6operations, 10fundraising-tech-ops, 5Patch-For-Review: remove fundraising banner log related cruft from production puppet - https://phabricator.wikimedia.org/T118325#1824953 (10ArielGlenn) p:5Triage>3Low [13:44:40] 6operations, 10Traffic, 5Patch-For-Review: Switch Varnish's GeoIP code to libmaxminddb/GeoIP2 - https://phabricator.wikimedia.org/T99226#1824956 (10Nemo_bis) How to test accuracy? The original accuracy test for MaxMind was https://meta.wikimedia.org/wiki/MaxMindCityTesting [13:46:16] (03CR) 10Filippo Giunchedi: [C: 04-1] base::certificates: add puppet's CA to the trusted store (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/252681 (https://phabricator.wikimedia.org/T114638) (owner: 10Giuseppe Lavagetto) [13:47:25] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [13:47:38] 7Puppet, 6operations, 5Patch-For-Review, 7Performance: Investigate mysterious_sysctl settings and figure out what to do with them - https://phabricator.wikimedia.org/T118812#1824958 (10ArielGlenn) https://tools.ietf.org/html/draft-stenberg-httpbis-tcp-00 covers a few of these and is supposedly current. [13:48:09] 7Puppet, 6operations, 5Patch-For-Review, 7Performance: Investigate mysterious_sysctl settings and figure out what to do with them - https://phabricator.wikimedia.org/T118812#1824959 (10ArielGlenn) p:5Triage>3Low [13:53:38] 6operations, 10Incident-20150825-Redis, 5Patch-For-Review: Enable memory overcommit for all redis hosts with persistance - https://phabricator.wikimedia.org/T91498#1824965 (10ArielGlenn) What else needs to happen here? I see the change was merged a couple of months back. [13:58:39] (03PS6) 10BBlack: varnish: get rid of some pre-systemd cruft [puppet] - 10https://gerrit.wikimedia.org/r/228591 [14:00:49] (03CR) 10BBlack: "Yeah the possible race is during initial system install, which could cause the sort of issue where after an initial puppet run fails, we h" [puppet] - 10https://gerrit.wikimedia.org/r/228591 (owner: 10BBlack) [14:02:17] (03CR) 10BBlack: [C: 032] varnish: get rid of some pre-systemd cruft [puppet] - 10https://gerrit.wikimedia.org/r/228591 (owner: 10BBlack) [14:14:26] 6operations, 10Traffic, 5Patch-For-Review: Switch Varnish's GeoIP code to libmaxminddb/GeoIP2 - https://phabricator.wikimedia.org/T99226#1825010 (10faidon) The vendor for both databases is the same, so the accuracy of the IPv4 data would be the same. The IPv6 switch means that we'd geolocate IPv6 directly no... [14:19:12] (03PS1) 10Faidon Liambotis: icinga: update check_legal_html with new HTML text [puppet] - 10https://gerrit.wikimedia.org/r/254854 [14:19:54] (03CR) 10Faidon Liambotis: [C: 032] icinga: update check_legal_html with new HTML text [puppet] - 10https://gerrit.wikimedia.org/r/254854 (owner: 10Faidon Liambotis) [14:20:40] (03PS1) 10Zfilipin: RuboCop: Fixed Style/BlockDelimiters offense [puppet] - 10https://gerrit.wikimedia.org/r/254855 (https://phabricator.wikimedia.org/T112651) [14:20:58] (03CR) 10Zfilipin: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/254855 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [14:22:19] Krenair: is my response to https://gerrit.wikimedia.org/r/#/c/245139/ sufficient? [14:22:34] Krenair: (or iow, should this be abandoned?) [14:25:01] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [14:26:51] 6operations, 6Analytics-Backlog, 7HTTPS: EventLogging sees too few distinct client IPs - https://phabricator.wikimedia.org/T119144#1825022 (10Ottomata) BTW, we are soon removing `ip` and `x_forwarded_for` from the webrequest table, in favor of the `X-Client-IP` header that will be set on all varnish requests... [14:27:26] 6operations, 7Pybal: Pybal IdleConnectionMonitor with TCP KeepAlive shows random fails if more than 100 servers are involved. - https://phabricator.wikimedia.org/T119372#1825023 (10Joe) From a full strace of pybal, I was able to see what happens when we get that warning in pybal: the socket closes as it receiv... [14:28:25] (03CR) 10Jgreen: [C: 031] Remove role::logging::udp2log::erbium and friends [puppet] - 10https://gerrit.wikimedia.org/r/252961 (https://phabricator.wikimedia.org/T84062) (owner: 10Faidon Liambotis) [14:29:55] 6operations, 6Analytics-Kanban, 7Database: db1046 running out of disk space - https://phabricator.wikimedia.org/T119380#1825024 (10Milimetric) p:5Triage>3High [14:30:02] PROBLEM - puppet last run on rutherfordium is CRITICAL: CRITICAL: Puppet last ran 5 days ago [14:30:12] 6operations, 6Analytics-Kanban, 7Database: db1046 running out of disk space - https://phabricator.wikimedia.org/T119380#1825027 (10Milimetric) a:3Milimetric [14:32:03] RECOVERY - puppet last run on rutherfordium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:33:14] (03PS1) 10Faidon Liambotis: icinga: update check_legal_html whitespace fixup [puppet] - 10https://gerrit.wikimedia.org/r/254856 [14:33:37] (03PS2) 10Faidon Liambotis: icinga: check_legal_html whitespace fixup [puppet] - 10https://gerrit.wikimedia.org/r/254856 [14:33:48] heya godog: https://gerrit.wikimedia.org/r/#/c/254846/1 [14:34:19] paravoid: may I merge and sheppard this? [14:34:20] https://gerrit.wikimedia.org/r/#/c/252961/ [14:34:24] i have a bit of time atm [14:34:31] (03CR) 10Faidon Liambotis: [C: 032 V: 032] icinga: check_legal_html whitespace fixup [puppet] - 10https://gerrit.wikimedia.org/r/254856 (owner: 10Faidon Liambotis) [14:34:36] ottomata: please do! [14:34:47] :) [14:36:16] 6operations, 7Pybal: Pybal IdleConnectionMonitor with TCP KeepAlive shows random fails if more than 100 servers are involved. - https://phabricator.wikimedia.org/T119372#1825044 (10BBlack) So, it sounds like apache sends a RST after about 64 seconds? That's probably not normal behavior for a normal connection... [14:38:13] (03PS7) 10Ottomata: Remove role::logging::udp2log::erbium and friends [puppet] - 10https://gerrit.wikimedia.org/r/252961 (https://phabricator.wikimedia.org/T84062) (owner: 10Faidon Liambotis) [14:38:17] !log rebooting lvs3003 with new kernel [14:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:39:10] ottomata: ack, I'll take a look [14:39:31] danke [14:39:53] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [14:40:01] (03CR) 10Ottomata: [C: 032] Remove role::logging::udp2log::erbium and friends [puppet] - 10https://gerrit.wikimedia.org/r/252961 (https://phabricator.wikimedia.org/T84062) (owner: 10Faidon Liambotis) [14:48:21] PROBLEM - Host cp1053 is DOWN: PING CRITICAL - Packet loss = 100% [14:49:19] (03PS1) 10Ottomata: Remove more erbium related udp2log stuff [puppet] - 10https://gerrit.wikimedia.org/r/254860 (https://phabricator.wikimedia.org/T84062) [14:50:22] (03CR) 10Ottomata: [C: 032] Remove more erbium related udp2log stuff [puppet] - 10https://gerrit.wikimedia.org/r/254860 (https://phabricator.wikimedia.org/T84062) (owner: 10Ottomata) [14:50:42] PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp1053_v4, cp1053_v6 [14:50:48] !log removing erbium udp2log varnishncsa instances from all caches [14:50:51] PROBLEM - IPsec on cp3014 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp1053_v4, cp1053_v6 [14:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:50:52] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp1053_v4, cp1053_v6 [14:50:52] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp1053_v4, cp1053_v6 [14:51:02] PROBLEM - IPsec on cp4009 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp1053_v4, cp1053_v6 [14:51:02] PROBLEM - IPsec on cp3013 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp1053_v4, cp1053_v6 [14:51:12] PROBLEM - IPsec on cp3003 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp1053_v4, cp1053_v6 [14:51:12] PROBLEM - IPsec on cp3005 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp1053_v4, cp1053_v6 [14:51:12] PROBLEM - IPsec on cp3031 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp1053_v4, cp1053_v6 [14:51:13] PROBLEM - IPsec on cp4018 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp1053_v4, cp1053_v6 [14:51:21] PROBLEM - IPsec on cp3040 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp1053_v4, cp1053_v6 [14:51:21] PROBLEM - IPsec on cp3006 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp1053_v4, cp1053_v6 [14:51:21] PROBLEM - IPsec on cp3030 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp1053_v4, cp1053_v6 [14:51:22] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp1053_v4, cp1053_v6 [14:51:23] bblack: is cp1053 you? [14:51:31] PROBLEM - IPsec on cp4008 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp1053_v4, cp1053_v6 [14:51:33] PROBLEM - IPsec on cp4016 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp1053_v4, cp1053_v6 [14:51:33] PROBLEM - IPsec on cp3004 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp1053_v4, cp1053_v6 [14:52:11] PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp1053_v4, cp1053_v6 [14:52:12] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp1053_v4, cp1053_v6 [14:52:12] PROBLEM - IPsec on cp4010 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp1053_v4, cp1053_v6 [14:52:13] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp1053_v4, cp1053_v6 [14:52:21] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp1053_v4, cp1053_v6 [14:52:22] PROBLEM - IPsec on cp3012 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp1053_v4, cp1053_v6 [14:52:22] PROBLEM - IPsec on cp3041 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp1053_v4, cp1053_v6 [14:52:22] PROBLEM - IPsec on cp3009 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp1053_v4, cp1053_v6 [14:52:34] I don't think so [14:52:42] PROBLEM - IPsec on cp3010 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp1053_v4, cp1053_v6 [14:52:42] PROBLEM - IPsec on cp4017 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp1053_v4, cp1053_v6 [14:52:42] PROBLEM - IPsec on cp3008 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp1053_v4, cp1053_v6 [14:52:42] PROBLEM - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 14 connecting: cp1053_v4, cp1053_v6 [14:52:53] the ipsec spam sucks on one machine failing :/ [14:52:58] yeah :( [14:53:12] PROBLEM - HTTP 5xx reqs/min threshold on graphite1001 is CRITICAL: CRITICAL: 9.09% of data above the critical threshold [500.0] [14:53:33] !log depooling cp1053 [14:53:36] nothing on the console [14:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:53:40] I'm going to powercycle [14:53:56] !log powercycling cp1053, unresponsive, blank console [14:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:54:59] it's depooled [14:55:43] PROBLEM - Varnish traffic logger - erbium on cp1054 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [14:55:44] PROBLEM - Varnish traffic logger - erbium on cp3046 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 114 (varnishlog) [14:55:56] ^ ottomata? [14:55:59] ok [14:56:06] I stopped pybal on lvs3001 for... 20 seconds [14:56:10] (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM generally, please fix commit message and syntax errors reported by jenkins. also if the specific problem is a graphite one, the root " (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/254846 (https://phabricator.wikimedia.org/T116035) (owner: 10Joal) [14:56:26] this is what happened on lvs3003 (running 4.2.0) when it got the traffic [14:56:32] RECOVERY - Host cp1053 is UP: PING WARNING - Packet loss = 93%, RTA = 3.98 ms [14:56:33] RECOVERY - IPsec on cp3010 is OK: Strongswan OK - 16 ESP OK [14:56:34] [ 893.799186] ------------[ cut here ]------------ [14:56:34] [ 893.804347] WARNING: CPU: 1 PID: 0 at /home/zumbi/linux-4.2.6/kernel/watchdo) [14:56:37] [ 893.816684] Watchdog detected hard LOCKUP on cpu 1 [14:56:40] [ 893.821838] Modules linked in: binfmt_misc ip_vs_sh ip_vs_wrr ip_vs nf_conntc [14:56:41] RECOVERY - IPsec on cp3008 is OK: Strongswan OK - 16 ESP OK [14:56:42] RECOVERY - IPsec on cp3007 is OK: Strongswan OK - 16 ESP OK [14:56:44] RECOVERY - IPsec on cp3014 is OK: Strongswan OK - 16 ESP OK [14:56:44] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 16 ESP OK [14:56:44] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 16 ESP OK [14:56:52] lvs3003 login: [ 876.381380] BUG: unable to handle kernel paging request at fff [14:56:53] RECOVERY - IPsec on cp4009 is OK: Strongswan OK - 16 ESP OK [14:56:55] [ 876.389173] IP: [] dst_gc_task+0x35/0x210 [14:56:57] [ 876.395405] PGD 1d2d067 PUD 0 [14:57:00] [ 876.398830] Oops: 0000 [#1] SMP [14:57:00] great... [14:57:01] RECOVERY - IPsec on cp3013 is OK: Strongswan OK - 16 ESP OK [14:57:03] RECOVERY - IPsec on cp3003 is OK: Strongswan OK - 16 ESP OK [14:57:03] RECOVERY - IPsec on cp3005 is OK: Strongswan OK - 16 ESP OK [14:57:06] <_joe_> paravoid: :/ [14:57:11] RECOVERY - IPsec on cp3031 is OK: Strongswan OK - 16 ESP OK [14:57:11] RECOVERY - IPsec on cp4018 is OK: Strongswan OK - 16 ESP OK [14:57:12] RECOVERY - IPsec on cp3006 is OK: Strongswan OK - 16 ESP OK [14:57:12] RECOVERY - IPsec on cp3040 is OK: Strongswan OK - 16 ESP OK [14:57:12] http://ganglia.wikimedia.org/latest/?c=LVS%20loadbalancers%20esams&h=vl100-eth0.lvs3003.esams.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [14:57:12] PROBLEM - Varnish traffic logger - erbium on cp3045 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 114 (varnishlog) [14:57:12] RECOVERY - IPsec on cp3030 is OK: Strongswan OK - 16 ESP OK [14:57:22] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 16 ESP OK [14:57:23] RECOVERY - IPsec on cp4008 is OK: Strongswan OK - 16 ESP OK [14:57:23] RECOVERY - IPsec on cp4016 is OK: Strongswan OK - 16 ESP OK [14:57:31] RECOVERY - IPsec on cp3004 is OK: Strongswan OK - 16 ESP OK [14:57:37] can't even SSH to it anymore [14:57:40] yay 4.2 :( [14:58:02] RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 16 ESP OK [14:58:02] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 16 ESP OK [14:58:03] RECOVERY - IPsec on cp4010 is OK: Strongswan OK - 16 ESP OK [14:58:11] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 16 ESP OK [14:58:12] PROBLEM - Varnish traffic logger - erbium on cp1055 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [14:58:12] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 16 ESP OK [14:58:21] RECOVERY - IPsec on cp3012 is OK: Strongswan OK - 16 ESP OK [14:58:21] PROBLEM - Varnish traffic logger - erbium on cp3012 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [14:58:21] RECOVERY - IPsec on cp3041 is OK: Strongswan OK - 16 ESP OK [14:58:21] RECOVERY - IPsec on cp3009 is OK: Strongswan OK - 16 ESP OK [14:58:32] RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 16 ESP OK [14:58:33] RECOVERY - IPsec on cp4017 is OK: Strongswan OK - 16 ESP OK [14:58:41] PROBLEM - Varnish traffic logger - erbium on cp3039 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 114 (varnishlog) [14:59:10] (03PS1) 10Ottomata: Remove gadolinium related udp2log and varnishncsa stuff [puppet] - 10https://gerrit.wikimedia.org/r/254864 (https://phabricator.wikimedia.org/T84062) [14:59:12] PROBLEM - Varnish traffic logger - erbium on cp3016 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [14:59:21] bblack hm, i just merged a change and had monitoring ensure => false. [14:59:31] PROBLEM - Host lvs3003 is DOWN: PING CRITICAL - Packet loss = 100% [14:59:38] i think probably it will need to run on all varnishes and then puppet run on icinga [14:59:39] neon* [14:59:42] ottomata: probably a race with neon then, will run puppet on master->neon [14:59:44] yeah [14:59:45] yeah [14:59:52] PROBLEM - Varnish traffic logger - erbium on cp4009 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:00:09] but, might as well wait for puppet to run on all varnishes i guess, this is probably going to happen with all of them :/ [15:00:10] [ 877.184863] Fixing recursive fault but reboot is needed! [15:00:11] fun fun [15:00:20] not sure of a better way to fix that...aside from manually editing configs on neon [15:00:42] anomie: ostriches: thcipriani: marktraceur: Krenair: Hi! I'm about to add a patch to today's mornign SWAT... It's just a small change to CentralNotice. Just to also note: I have to be AFK from now until about 10 minutes before the SWAT, but will be here during and after the deploy, hope that's OK! [15:00:43] PROBLEM - Varnish traffic logger - erbium on cp1053 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:00:43] PROBLEM - Varnish traffic logger - erbium on cp4016 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:00:52] PROBLEM - Varnish traffic logger - erbium on cp2002 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:01:12] PROBLEM - Varnish traffic logger - erbium on cp3007 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:01:18] (03CR) 10Ottomata: [C: 04-1] "Ah wait, we can't do this yet, the sqstat udp2log instance is still running on analytics1026. We need to fix other reqstats related thing" [puppet] - 10https://gerrit.wikimedia.org/r/254864 (https://phabricator.wikimedia.org/T84062) (owner: 10Ottomata) [15:01:21] PROBLEM - Varnish traffic logger - erbium on cp3008 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:01:23] PROBLEM - Varnish traffic logger - erbium on cp1068 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:01:23] PROBLEM - Varnish traffic logger - erbium on cp4010 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:01:32] (03PS2) 10coren: Labs: switch labstore NFS server to explicit LDAP [puppet] - 10https://gerrit.wikimedia.org/r/254176 [15:01:34] (03CR) 10Ottomata: "See:" [puppet] - 10https://gerrit.wikimedia.org/r/254864 (https://phabricator.wikimedia.org/T84062) (owner: 10Ottomata) [15:01:42] PROBLEM - Varnish traffic logger - erbium on cp2001 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:01:42] PROBLEM - Varnish traffic logger - erbium on cp2013 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:01:51] PROBLEM - Varnish traffic logger - erbium on cp3017 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:01:51] PROBLEM - Varnish traffic logger - erbium on cp3048 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 114 (varnishlog) [15:02:29] apergos: Do you have a few secs to +1 https://gerrit.wikimedia.org/r/#/c/254176/ ? Very trivial. [15:02:38] lookin [15:02:44] heya bblack, paravoid: https://phabricator.wikimedia.org/T117727 [15:02:52] PROBLEM - Varnish traffic logger - erbium on cp1064 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:02:54] mainly i think the role::graphite::production::alerts need modified [15:03:02] PROBLEM - Varnish traffic logger - erbium on cp3003 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:03:03] before we can proceed with udp2log turn off [15:03:21] RECOVERY - HTTP 5xx reqs/min threshold on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:03:22] PROBLEM - Varnish traffic logger - erbium on cp3036 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 114 (varnishlog) [15:03:26] not sure if there are others [15:04:02] PROBLEM - Varnish traffic logger - erbium on cp4018 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:04:02] PROBLEM - Varnish traffic logger - erbium on cp1048 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:04:12] PROBLEM - Varnish traffic logger - erbium on cp4013 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:04:23] PROBLEM - Varnish traffic logger - erbium on cp3032 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 114 (varnishlog) [15:04:27] (03CR) 10ArielGlenn: [C: 031] Labs: switch labstore NFS server to explicit LDAP [puppet] - 10https://gerrit.wikimedia.org/r/254176 (owner: 10coren) [15:04:52] apergos: ευχαριστώ :-) [15:05:16] anomie: ostriches: thcipriani: marktraceur: Krenair: Added to the deployments page... Didn't +2 the core patch yet, maybe I should? [15:05:30] K gotta run for a bit, back soon... I'll get backscrool tho :) thanks!!!! [15:05:35] (03CR) 10coren: [C: 032] Labs: switch labstore NFS server to explicit LDAP [puppet] - 10https://gerrit.wikimedia.org/r/254176 (owner: 10coren) [15:05:53] PROBLEM - Varnish traffic logger - erbium on cp3015 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:06:17] apergos: here’s the patch for one of those access requests: https://gerrit.wikimedia.org/r/#/c/254409/ [15:06:22] PROBLEM - Varnish traffic logger - erbium on cp1060 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:06:38] I haven’t done anything for ejegg since he only just got his request organized. [15:06:42] PROBLEM - Varnish traffic logger - erbium on cp2009 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:06:55] apergos: I think those are the only loose ends left from last week’s clinic [15:07:01] PROBLEM - Varnish traffic logger - erbium on cp4006 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:07:11] PROBLEM - Varnish traffic logger - erbium on cp2004 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:07:12] PROBLEM - Varnish traffic logger - erbium on cp1059 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:07:23] PROBLEM - Varnish traffic logger - erbium on cp4020 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:07:43] andrewbogott: thanks, need one of them to sign the doc yet [15:07:48] (already pinged) [15:07:52] PROBLEM - Varnish traffic logger - erbium on cp2024 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:08:01] it’s about to get loud in here [15:08:11] uh oh [15:08:13] PROBLEM - Varnish traffic logger - erbium on cp3049 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 114 (varnishlog) [15:08:22] PROBLEM - Varnish traffic logger - erbium on cp1067 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:08:23] PROBLEM - Varnish traffic logger - erbium on cp1052 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:08:42] PROBLEM - Varnish traffic logger - erbium on cp1047 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:08:50] 10Ops-Access-Requests, 6operations: Grant jgirault an jan_drewniak access to the eventlogging db on stat1003 and hive to query webrequests tables on stat1002 - https://phabricator.wikimedia.org/T118998#1825119 (10ArielGlenn) https://gerrit.wikimedia.org/r/#/c/254409/ via @Andrew [15:09:21] PROBLEM - Varnish traffic logger - erbium on cp1044 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:09:22] PROBLEM - Varnish traffic logger - erbium on cp3014 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:09:32] PROBLEM - Varnish traffic logger - erbium on cp4012 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:09:52] PROBLEM - Varnish traffic logger - erbium on cp2005 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:10:33] PROBLEM - Varnish traffic logger - erbium on cp3037 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 114 (varnishlog) [15:11:14] !log Restarting NFS daemon on labstore1001 w/ the new LDAP shim. [15:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:11:21] PROBLEM - Varnish traffic logger - erbium on cp1061 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:11:32] PROBLEM - Varnish traffic logger - erbium on cp4008 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:11:41] PROBLEM - Varnish traffic logger - erbium on cp2008 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:12:12] PROBLEM - Varnish traffic logger - erbium on cp3038 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 114 (varnishlog) [15:14:03] PROBLEM - Varnish traffic logger - erbium on cp4011 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:14:03] PROBLEM - Varnish traffic logger - erbium on cp2020 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:14:04] PROBLEM - Varnish traffic logger - erbium on cp3041 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 114 (varnishlog) [15:14:04] PROBLEM - Varnish traffic logger - erbium on cp1046 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:14:04] PROBLEM - Varnish traffic logger - erbium on cp1099 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:14:04] PROBLEM - Varnish traffic logger - erbium on cp2019 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:14:04] PROBLEM - Varnish traffic logger - erbium on cp3005 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:14:04] PROBLEM - Varnish traffic logger - erbium on cp3004 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:14:04] PROBLEM - Varnish traffic logger - erbium on cp3030 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 114 (varnishlog) [15:14:04] PROBLEM - Varnish traffic logger - erbium on cp2023 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:14:04] PROBLEM - Varnish traffic logger - erbium on cp1050 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:14:04] PROBLEM - Varnish traffic logger - erbium on cp2017 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:14:04] PROBLEM - Varnish traffic logger - erbium on cp3043 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 114 (varnishlog) [15:14:06] PROBLEM - Varnish traffic logger - erbium on cp3018 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:14:07] PROBLEM - Varnish traffic logger - erbium on cp2007 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:14:07] PROBLEM - Varnish traffic logger - erbium on cp3047 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 114 (varnishlog) [15:14:07] PROBLEM - Varnish traffic logger - erbium on cp1074 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:14:21] PROBLEM - Varnish traffic logger - erbium on cp4005 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:14:22] PROBLEM - Varnish traffic logger - erbium on cp4007 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:14:33] PROBLEM - Varnish traffic logger - erbium on cp3044 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 114 (varnishlog) [15:14:41] PROBLEM - Varnish traffic logger - erbium on cp1071 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:14:42] PROBLEM - Varnish traffic logger - erbium on cp1051 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:14:51] PROBLEM - Varnish traffic logger - erbium on cp3042 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 114 (varnishlog) [15:14:51] PROBLEM - Varnish traffic logger - erbium on cp3033 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 114 (varnishlog) [15:14:52] PROBLEM - Varnish traffic logger - erbium on cp1063 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:15:15] !log NFS server on labstore1001 restarted with no issues. [15:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:15:22] PROBLEM - Varnish traffic logger - erbium on cp2003 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:15:32] PROBLEM - Varnish traffic logger - erbium on cp3031 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 114 (varnishlog) [15:15:41] PROBLEM - Varnish traffic logger - erbium on cp1049 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:15:52] PROBLEM - Varnish traffic logger - erbium on cp3035 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 114 (varnishlog) [15:16:03] PROBLEM - Varnish traffic logger - erbium on cp1072 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:16:12] PROBLEM - Varnish traffic logger - erbium on cp2011 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:16:12] PROBLEM - Varnish traffic logger - erbium on cp2015 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:16:13] PROBLEM - Varnish traffic logger - erbium on cp3013 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:16:13] PROBLEM - Varnish traffic logger - erbium on cp3034 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 114 (varnishlog) [15:16:13] PROBLEM - Varnish traffic logger - erbium on cp3040 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 114 (varnishlog) [15:16:22] PROBLEM - Varnish traffic logger - erbium on cp2021 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:16:31] RECOVERY - Host lvs3003 is UP: PING OK - Packet loss = 0%, RTA = 87.94 ms [15:16:52] PROBLEM - Varnish traffic logger - erbium on cp3006 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:16:52] PROBLEM - Varnish traffic logger - erbium on cp3009 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:17:12] PROBLEM - Varnish traffic logger - erbium on cp4014 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:20:23] PROBLEM - Varnish traffic logger - erbium on cp3010 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:21:02] PROBLEM - Varnish traffic logger - erbium on cp2014 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:21:11] PROBLEM - Varnish traffic logger - erbium on cp1066 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:21:51] PROBLEM - Varnish traffic logger - erbium on cp4015 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:21:52] PROBLEM - Varnish traffic logger - erbium on cp4019 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:21:52] PROBLEM - Varnish traffic logger - erbium on cp2016 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:22:42] PROBLEM - Varnish traffic logger - erbium on cp1073 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:23:36] 6operations, 6Analytics-Kanban, 10CirrusSearch, 6Discovery, and 2 others: Delete logs on stat1002 in /a/mw-log/archive that are more than 90 days old. - https://phabricator.wikimedia.org/T118527#1825194 (10Ironholds) It would be really good if it could have been explicitly called out (with a ping) that the... [15:23:41] PROBLEM - Varnish traffic logger - erbium on cp2010 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:24:21] PROBLEM - Varnish traffic logger - erbium on cp1062 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:24:32] PROBLEM - Varnish traffic logger - erbium on cp4017 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:24:33] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [15:24:42] PROBLEM - Varnish traffic logger - erbium on cp1065 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:25:01] PROBLEM - Varnish traffic logger - erbium on cp1043 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) [15:25:26] 6operations, 10Traffic: Convert misc cluster to 2-layer - https://phabricator.wikimedia.org/T119394#1825206 (10BBlack) 3NEW [15:25:58] 6operations, 10Traffic: Convert misc cluster to 2-layer - https://phabricator.wikimedia.org/T119394#1825206 (10BBlack) [15:25:59] 6operations, 10Traffic, 5Patch-For-Review: Refactor varnish puppet config - https://phabricator.wikimedia.org/T96847#1825212 (10BBlack) [15:26:43] 6operations, 10Traffic, 7discovery-system, 5services-tooling: Backport etcd 2.2 to jessie - https://phabricator.wikimedia.org/T118830#1825214 (10Joe) a:3Joe [15:27:51] PROBLEM - Host lvs3003 is DOWN: PING CRITICAL - Packet loss = 100% [15:30:19] (03PS1) 10Filippo Giunchedi: diamond: use upstream StatsdHandler [puppet] - 10https://gerrit.wikimedia.org/r/254872 (https://phabricator.wikimedia.org/T116033) [15:30:56] 6operations, 10Traffic: Create globally-unique varnish cache cluster port/instancename mappings - https://phabricator.wikimedia.org/T119396#1825233 (10BBlack) 3NEW [15:31:11] 6operations, 10Traffic: Switch port 80 to nginx on primary clusters - https://phabricator.wikimedia.org/T107236#1825241 (10BBlack) [15:31:12] 6operations, 10Traffic: Create globally-unique varnish cache cluster port/instancename mappings - https://phabricator.wikimedia.org/T119396#1825242 (10BBlack) [15:31:31] 6operations, 10Traffic: Convert misc cluster to 2-layer - https://phabricator.wikimedia.org/T119394#1825244 (10BBlack) [15:31:32] 6operations, 10Traffic: Create globally-unique varnish cache cluster port/instancename mappings - https://phabricator.wikimedia.org/T119396#1825233 (10BBlack) [15:33:28] Coren: if you are no longer mucking with NFS, could I get you to do a few migrations of tools instances? [15:34:18] andrewbogott: I'm still in it for a little bit; the actual restart is done though so it's no longer as time sensitive. What do you need migrated? [15:34:47] Coren: labvirt1002 is getting full, should move two or three things to labvirt1010 [15:35:09] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [15:35:09] not pressing but should happen today. I’m happy to do it, just trying to make sure everyone knows how/has done it :) [15:35:34] andrewbogott: If you make a ticket and assign it to me I'll be able to do it in ~1h if that's okay with you? [15:35:43] sure! thanks [15:36:02] 6operations, 6Labs, 10Tool-Labs, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1825257 (10jcrespo) So, the recovery thread was still running, efectively setting the hosts in read-only mode. labsdb1002 has finally started its replication proc... [15:36:14] 6operations, 10Traffic: Create globally-unique varnish cache cluster port/instancename mappings - https://phabricator.wikimedia.org/T119396#1825258 (10BBlack) Further notes: The changes to the instance name and admin_port should be machine-local in effect - they're not going to cause actual traffic havoc, but... [15:37:01] (03CR) 10Rush: [C: 031] "No objection from me, fwiw in regards to T116033 we could ship metrics in batches either way. The different is dastatsd will ship in batch" [puppet] - 10https://gerrit.wikimedia.org/r/254872 (https://phabricator.wikimedia.org/T116033) (owner: 10Filippo Giunchedi) [15:38:12] (03PS1) 10Filippo Giunchedi: diamond: send statsd metrics in batches [puppet] - 10https://gerrit.wikimedia.org/r/254873 (https://phabricator.wikimedia.org/T116033) [15:39:30] 6operations, 6Analytics-Kanban, 10CirrusSearch, 6Discovery, and 2 others: Delete logs on stat1002 in /a/mw-log/archive that are more than 90 days old. - https://phabricator.wikimedia.org/T118527#1825288 (10Ottomata) Aye ok. I should have been more explicit about this. (You were CCed on this ticket though... [15:41:34] 6operations, 10ops-eqiad, 5Patch-For-Review: mw1041 has hardware issues - https://phabricator.wikimedia.org/T119199#1825293 (10Cmjohnson) [15:42:57] AndyRussG|a-whey: everything looks good for SWAT, normally, for submodules that's branch name tracks core's version (e.g. a branch is cut during train named wmf/1.27.0-wmf.7) it is unnecessary to create a submodule bump (gerrit will bump the submodule on that particular branch or mediawiki/core; however, in this case, (since you use the long-lived wmf-deploy branch for extensions/CentralNotice) [15:42:59] what you have setup seems fine. Thanks for checking pre-SWAT :) [15:51:50] RECOVERY - Host lvs3003 is UP: PING OK - Packet loss = 0%, RTA = 88.08 ms [15:53:25] 6operations, 6Labs: Untangle labs/production roles from labs/instance roles - https://phabricator.wikimedia.org/T119401#1825326 (10Andrew) 3NEW a:3Andrew [15:53:34] <_joe_> !log live-resizing /var/lib/wdqs on wdqs1002 [15:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:55:09] RECOVERY - Disk space on wdqs1002 is OK: DISK OK [15:56:22] !log rebooted lvs3003 back with (an updated) 3.19 [15:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:58:50] thcipriani: hi! thanks! :) Yeah CentralNotice is still on this old system [15:59:39] (03CR) 10Odder: "This is now ready to merged as the domain is now using the Foundation's name servers for DNS." [dns] - 10https://gerrit.wikimedia.org/r/254294 (owner: 10Odder) [16:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151123T1600). Please do the needful. [16:00:05] MatmaRex kart_ jhobs AndyRussG Glaisher: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:14] I'm here [16:00:16] hi. [16:00:37] i can go later, in a meeting atm [16:00:44] Hi [16:00:45] thcipriani: I'll +2 the core patch? [16:01:10] (03CR) 10Mforns: [C: 04-1] "LGTM, but there's this missing comma. cheers" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/254846 (https://phabricator.wikimedia.org/T116035) (owner: 10Joal) [16:01:25] I can SWAT today. AndyRussG the submodule bump? No, I'll +2 that as part of SWAT. [16:01:42] thcipriani: OK got it :) thanks! [16:01:59] LMK if u need anything or see any issues ;) [16:02:26] It's not necessary to scap for us, BTW. No new i18n messages [16:02:30] here too. [16:03:05] MatmaRex: do you have a patch for core's .7 branch that backports 254373? [16:03:37] thcipriani: will submit in a minute [16:03:44] (03CR) 10Odder: "This can now be merged as the domain is now using the Foundation's servers for DNS." [puppet] - 10https://gerrit.wikimedia.org/r/254305 (owner: 10Odder) [16:03:48] MatmaRex: kk, thanks. [16:04:19] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254117 (https://phabricator.wikimedia.org/T118033) (owner: 10KartikMistry) [16:05:02] (03Merged) 10jenkins-bot: CX: Enable article-recommender-1 campaign as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254117 (https://phabricator.wikimedia.org/T118033) (owner: 10KartikMistry) [16:05:41] that's weird, fatalmonitor on fluorine is just showing 3 lines. [16:07:44] my backport is https://gerrit.wikimedia.org/r/#/c/254880/ , updated the Deployments page too. [16:07:44] (03PS2) 10Odder: Redirect wikiquote.pl to pl.wikiquote.org [puppet] - 10https://gerrit.wikimedia.org/r/254305 [16:08:25] 10Ops-Access-Requests, 6operations, 6Multimedia: Give Bartosz access to stat1003 ("researchers" and "statistics-users") - https://phabricator.wikimedia.org/T119404#1825365 (10Jdforrester-WMF) 3NEW [16:10:25] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: CX: Enable article-recommender-1 campaign as default [[gerrit:254117]] (duration: 00m 52s) [16:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:10:33] ^ kart_ check please [16:11:40] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [16:12:15] lots of abusefilter.php spam in fatalmonitor [16:12:24] (pre-deployment as well as post) [16:12:58] (03PS1) 10coren: Labs: Have fileservers no longer nsswitch to LDAP [puppet] - 10https://gerrit.wikimedia.org/r/254881 (https://phabricator.wikimedia.org/T87870) [16:13:52] (03CR) 10jenkins-bot: [V: 04-1] Labs: Have fileservers no longer nsswitch to LDAP [puppet] - 10https://gerrit.wikimedia.org/r/254881 (https://phabricator.wikimedia.org/T87870) (owner: 10coren) [16:15:03] thcipriani: thanks! [16:15:13] kart_: everything look good? [16:15:47] (still here, had a connection issue) [16:17:00] MatmaRex: kk, you're up next. [16:17:17] thcipriani: leila will check. Nothing for me here :) [16:17:20] PROBLEM - Ensure legal html en.wb on en.wikibooks.org is CRITICAL: a href=//wikimediafoundation.org/wiki/Privacy_policyPrivacy policy/a html not found [16:17:22] so we're good. [16:17:26] kart_: kk [16:20:52] (03PS2) 10Joal: monitoring/graphite: Add until time limit argument [puppet] - 10https://gerrit.wikimedia.org/r/254846 (https://phabricator.wikimedia.org/T116035) [16:21:39] (03PS2) 10coren: Labs: Have fileservers no longer nsswitch to LDAP [puppet] - 10https://gerrit.wikimedia.org/r/254881 (https://phabricator.wikimedia.org/T87870) [16:22:03] weird.. I can see the privacy policy link on enwb [16:22:09] (03CR) 10Joal: monitoring/graphite: Add until time limit argument (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/254846 (https://phabricator.wikimedia.org/T116035) (owner: 10Joal) [16:24:35] 10Ops-Access-Requests, 6operations, 6Multimedia: Give Bartosz access to stat1003 ("researchers" and "statistics-users") - https://phabricator.wikimedia.org/T119404#1825404 (10Krenair) statistics-users is not necessary, researchers is the right group. [16:26:11] apergos: https://gerrit.wikimedia.org/r/#/c/254881/ when you have a few. [16:26:47] looking [16:27:55] is it still nor merged? :/ [16:28:09] i think we need to seriously reevaluate our ability to deploy 8 patches in an hour… [16:28:56] or stop running zend test for wmf branches? have we stopped using php 5.3 in production yet? [16:28:57] MatmaRex: yeah, I should have done config patches first, which is my fault. But yes, until we move away from using the zend test for production code, I think you are likely correct. [16:30:26] (03CR) 10Mforns: monitoring/graphite: Add until time limit argument (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/254846 (https://phabricator.wikimedia.org/T116035) (owner: 10Joal) [16:30:37] I wouldn't expect mine to cause any issues or take much time; it's simply removing a config we added a couple weeks ago. We'll probably be ok [16:30:42] (03CR) 10Mforns: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/254846 (https://phabricator.wikimedia.org/T116035) (owner: 10Joal) [16:32:49] !log thcipriani@tin Synchronized php-1.27.0-wmf.7/includes/specials/SpecialWatchlist.php: SWAT: Special:Watchlist: Add user preference to "Show last" options, fix float comparison [[gerrit:254880]] (duration: 00m 29s) [16:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:32:55] ^ MatmaRex check please [16:33:20] thcipriani: all is fine, thanks! [16:33:27] MatmaRex: thank you. [16:33:33] OK, continuing with config changes. [16:33:45] Coren: that file in base-files: what puts it there and do we guarantee it being there? [16:34:16] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254831 (https://phabricator.wikimedia.org/T118881) (owner: 10Jhobs) [16:34:38] (03Merged) 10jenkins-bot: End first mobile QuickSurveys campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254831 (https://phabricator.wikimedia.org/T118881) (owner: 10Jhobs) [16:35:01] 6operations, 10Deployment-Systems: l10nupdate user uid mismatch between tin and mira - https://phabricator.wikimedia.org/T119165#1825437 (10bd808) I can assert that we are currently seeing uid preservation when rsycning from tin to mira. Checking the ownership of `/srv/mediawiki-staging/php-1.27.0-wmf.7/cache/... [16:36:30] jhobs: guess I should have asked this before I merged. Having wmgUseQuickSurveys set to true for enwiki without any wgQuickSurveyConfig default isn't going to cause any havoc, right? [16:36:42] thcipriani: nope [16:36:44] kk [16:38:24] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: End first mobile QuickSurveys campaign [[gerrit:254831]] (duration: 00m 28s) [16:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:38:29] ^ jhobs check please [16:38:37] apergos: https://gerrit.wikimedia.org/r/#/c/254881/ [16:38:47] apergos: https://packages.debian.org/search?keywords=base-files [16:38:51] thcipriani: looks good, thanks! [16:38:54] jhobs: thanks! [16:38:55] Sorry, first was a mispaste. :-) [16:39:40] jhobs, thcipriani: looks good in the browser -- just want to look at the warning log to see if the value of `true` really isn't causing havoc [16:39:42] apergos: tl;dr: that's part of the base-files package and is part of the minimal system. [16:39:52] right [16:40:04] AndyRussG: I'm going to come back to yours just because it requires the core test suite to run (zend). Glaisher I'm going to go ahead and get yours out the door. [16:40:24] alright [16:40:25] thcipriani: K np :) [16:40:30] phuedx: oh good call, didn't think of that. It shouldn't be though, that just loads the extension with 0 surveys enabled [16:40:38] (03CR) 10ArielGlenn: [C: 031] Labs: Have fileservers no longer nsswitch to LDAP [puppet] - 10https://gerrit.wikimedia.org/r/254881 (https://phabricator.wikimedia.org/T87870) (owner: 10coren) [16:40:42] look ok to me then [16:40:45] *looks [16:41:04] apergos: Thankee. [16:41:17] phuedx: good call, watching logs, nothing out of the ordinary apart from the abusefilter notice spam I mentioned earlier. [16:41:25] (03CR) 10coren: [C: 032] "Long, long overdue but there at last." [puppet] - 10https://gerrit.wikimedia.org/r/254881 (https://phabricator.wikimedia.org/T87870) (owner: 10coren) [16:41:44] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237330 (https://phabricator.wikimedia.org/T104797) (owner: 10Mdann52) [16:41:46] 6operations, 10Gitblit, 7Upstream: Accessing raw link on git.wikimedia.org causes "Error Sorry, the repository mediawiki does not have a extensions branch!" - https://phabricator.wikimedia.org/T118156#1825479 (10Paladox) Filed task here T119409 [16:42:25] (03Merged) 10jenkins-bot: noindex user namespace on en.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237330 (https://phabricator.wikimedia.org/T104797) (owner: 10Mdann52) [16:42:40] 6operations, 10Traffic, 7discovery-system, 5services-tooling: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972#1825482 (10Joe) a:3Joe [16:43:33] !log restarting mysqld on labsdb1003 again [16:43:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:44:20] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: noindex user namespace on en.wikipedia.org [[gerrit:237330]] (duration: 00m 28s) [16:44:22] ^ Glaisher check please [16:44:45] looking [16:45:10] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [16:45:46] Oh, bleh. [16:46:18] thcipriani: meta robots tag is now present and action=info has also changed. :) [16:46:25] thanks a lot [16:46:38] Glaisher: awesome! thanks for checking. [16:48:02] I'm guessing hiera() returns a string. ^@#% [16:48:42] thcipriani: thanks -- fatalmonitor looks clean to me [16:49:27] AndyRussG: now we wait for zuul. These changes require a full scap, seemingly? (l10nupdate stuff) [16:49:55] thcipriani: the full scap is not really needed actually [16:50:26] thcipriani: I mean, if you want to do it, go ahead. I just merged master into our deploy branch and that picked up some stuff from translatewiki [16:50:40] But the stuff that's urgent for today is just in PHP and JS code [16:51:04] (03PS1) 10coren: Fix nsswitch_use_default to a string [puppet] - 10https://gerrit.wikimedia.org/r/254885 (https://phabricator.wikimedia.org/T87870) [16:51:15] apergos: ^^ hiera() doesn't like booleans [16:51:50] ah boooo [16:52:12] AndyRussG: if you'll be around for a minute, I'd rather do the full scap. Doesn't look like there are any other deployments scheduled after this today (except evening SWAT). [16:52:20] anyone knows why on enwiki User:MediaWiki message delivery posted the message twice to some people (like me)? [16:52:21] thcipriani: sure! [16:52:29] 6operations, 6Analytics-Backlog, 7HTTPS: EventLogging sees too few distinct client IPs - https://phabricator.wikimedia.org/T119144#1825526 (10leila) @Nuria and @ori, agreed. Please keep me in the loop if a discussion happens outside of this thread and T118557. We have an upcoming research in Q3 that relies h... [16:52:52] 6operations, 6Labs, 10Tool-Labs, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1825528 (10jcrespo) Corruption is still on, I am tring to repair the tables manually. For that, I need to put the labsdb servers down for some time. This may take... [16:53:31] 6operations, 6Analytics-Backlog, 7HTTPS: EventLogging sees too few distinct client IPs - https://phabricator.wikimedia.org/T119144#1825529 (10ori) >>! In T119144#1825022, @Ottomata wrote: > BTW, we are [[ https://phabricator.wikimedia.org/T118557 | soon removing ]] `ip` and `x_forwarded_for` from the webrequ... [16:53:35] (03CR) 10ArielGlenn: [C: 031] Fix nsswitch_use_default to a string [puppet] - 10https://gerrit.wikimedia.org/r/254885 (https://phabricator.wikimedia.org/T87870) (owner: 10coren) [16:53:37] 6operations, 6Analytics-Backlog, 7HTTPS: EventLogging sees too few distinct client IPs - https://phabricator.wikimedia.org/T119144#1825531 (10leila) [16:53:47] hope I remember that for the next time [16:55:15] (03CR) 10coren: [C: 032] "Small fix." [puppet] - 10https://gerrit.wikimedia.org/r/254885 (https://phabricator.wikimedia.org/T87870) (owner: 10coren) [16:57:20] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [16:58:26] apergos: Baaah. Still doesn't work. *sigh* after the meeting. I hate hiera(). [16:58:58] 6operations, 10Dumps-Generation, 7HHVM, 5Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#1825557 (10ArielGlenn) [17:00:14] k [17:02:06] 6operations: Audit Trusty hosts in search of 3.13 kernel bug - https://phabricator.wikimedia.org/T119411#1825606 (10Andrew) 3NEW [17:02:17] !log thcipriani@tin Started scap: SWAT: Update CentralNotice [[gerrit:254865]] [17:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:02:46] PROBLEM - mysqld processes on labsdb1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [17:02:52] ^ AndyRussG started SWAT for CentralNotice, it'll rebuild the l10ncache before sync starts (which takes some time) I'll keep you posted. [17:03:57] 6operations: Audit Trusty hosts in search of 3.13 kernel bug - https://phabricator.wikimedia.org/T119411#1825615 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [17:04:57] thcipriani: K thx! [17:06:46] RECOVERY - mysqld processes on labsdb1003 is OK: PROCS OK: 1 process with command name mysqld [17:10:20] AndyRussG: started syncing, it'll still have to rebuild cdbs, but that _should be_ quick :) [17:10:32] cool! [17:16:26] thcipriani: woops, had to restart my computer, all good now ;) [17:16:39] AndyRussG: no problem, still sync-ing. [17:17:49] PROBLEM - HTTP 5xx reqs/min anomaly on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 9 below the confidence bounds [17:19:41] RECOVERY - HTTP 5xx reqs/min anomaly on graphite1001 is OK: OK: No anomaly detected [17:25:26] thcipriani: what's the status? aaaarg I just got a call from my kid's school that she's sick and I should pick her up :( [17:25:40] AndyRussG: just about done rebuilding CDBs [17:26:00] PROBLEM - Apache HTTP on mw1122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:26:02] should be done relatively shortly at this point [17:26:10] PROBLEM - puppet last run on mw2144 is CRITICAL: CRITICAL: puppet fail [17:26:28] K thanks! [17:27:00] PROBLEM - HHVM rendering on mw1122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:27:04] thcipriani: leeme get someone else from FR to cover for me during the 15 minutes it'll take me to go and get her [17:28:58] AndyRussG: kk, as long as I've got someone to test the new funcationality, I'm good. [17:28:58] PROBLEM - RAID on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:28:58] PROBLEM - Check size of conntrack table on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:28:59] PROBLEM - puppet last run on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:28:59] PROBLEM - DPKG on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:28:59] !log thcipriani@tin Finished scap: SWAT: Update CentralNotice [[gerrit:254865]] (duration: 26m 31s) [17:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:28:59] ^ AndyRussG finished! [17:29:25] thcipriani: K... it'll take about 10 minutes for the cache to turn over I think [17:29:39] RECOVERY - Check size of conntrack table on mw1122 is OK: OK: nf_conntrack is 0 % full [17:29:39] RECOVERY - RAID on mw1122 is OK: OK: no RAID installed [17:29:48] i'll restart mw1122 [17:30:20] RECOVERY - puppet last run on mw1122 is OK: OK: Puppet is currently enabled, last run 14 minutes ago with 0 failures [17:30:25] !log restarting hhvm on mw1122 after lock-up; backtrace saved as /tmp/hhvm.32016.bt. [17:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:31:40] RECOVERY - DPKG on mw1122 is OK: All packages OK [17:31:50] RECOVERY - Apache HTTP on mw1122 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.074 second response time [17:32:20] !log emergency hotfix on labstore1001 (apt-get remove nfsd-ldap) as it appears to not work in all situations. [17:32:21] 6operations, 6Analytics-Backlog, 7HTTPS: EventLogging sees too few distinct client IPs - https://phabricator.wikimedia.org/T119144#1825736 (10Nuria) @leila, @ottomata: note that even then not all evenlogging requests go through varnish, only the ones that come from the javascript client, there are 4 clients... [17:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:32:48] thcipriani: ejegg: cwd: K I'll be back in about 10 minutes. The functionality to test is just the use of cookies on FR campaigns for the impressions diet, though the main thing is just that nothing wonky is caused anywhere else (very unlikely). Thanks so much for covering for me!!!! [17:32:49] RECOVERY - HHVM rendering on mw1122 is OK: HTTP OK: HTTP/1.1 200 OK - 70672 bytes in 0.480 second response time [17:33:26] thcipriani: ^ ejegg and cwd should be able to help if there are issues! thx! [17:33:35] AndyRussG: thanks! [17:34:29] fwiw, all logs look fine, just need to make sure that everything works as expected :) [17:35:38] sup thcipriani [17:35:44] cwd: o/ [17:35:50] \o [17:35:56] wanna get lunch later? [17:37:29] wrong channel! [17:39:09] PROBLEM - HTTP 5xx reqs/min anomaly on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [17:39:27] thcipriani: you're in sf? [17:39:46] ori: no, cwd and I are both in Longmont :) [17:39:48] 5xx anomaly does not represent a real issue [17:40:10] peaks are in the 200-300 range, not insane given that i just restarted an app server [17:40:27] YuviPanda: hotfixed a rollback, but why it doesn't work is confusing the hell out of me - investigating now but function is back to happy fun. [17:41:31] It works perfectly on labstore1002. [17:41:39] * Coren grumbles. [17:42:28] Oooooh. I think I know wth is going on. [17:42:40] * Coren stares suspiciously at nscd. [17:46:09] 6operations, 10Deployment-Systems: l10nupdate user uid mismatch between tin and mira - https://phabricator.wikimedia.org/T119165#1825834 (10mmodell) @bd808: So it sounds like having consistent UIDs is the better / easier solution, based on the complexity of getting name mapping to work? [17:46:50] 6operations, 10hardware-requests: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1825840 (10Joe) I think quotes 718687410 and 718687406 are the most interesting ones. I would go with the least expensive one: it will still be a big improvement on what we have now a... [17:47:39] PROBLEM - puppet last run on mw1122 is CRITICAL: CRITICAL: Puppet has 1 failures [17:49:49] ejegg cwd looks like banners are showing fine. https://en.wikipedia.org/wiki/Main_Page?uselang=es&country=CA&randomcampaign=0.1&recordImpressionSampleRate=1&force=true [17:50:58] 6operations, 6Labs, 10Tool-Labs, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1825876 (10jcrespo) I think I may have fixed the corruption on labsdb1003, but sadly some data was lost in the process. I cannot guarantee the accuracy of its con... [17:51:04] thanks AndyRussG|mob, that's a relief [17:53:11] RECOVERY - puppet last run on mw2144 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [17:54:13] ACKNOWLEDGEMENT - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed Coren consequence of the labsdb outage [17:54:40] !log temporarily stopping labsdb1002 to solve replication issues [17:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:55:34] thcipriani: cwd: ejegg: back....! [17:56:16] 6operations: Kernel errors on rendering hosts - https://phabricator.wikimedia.org/T118888#1825911 (10MoritzMuehlenhoff) Auditing over servers will be handled through https://phabricator.wikimedia.org/T119411 [17:56:30] AndyRussG: that was quick :) logs look fine to me, haven't heard about any other problems [17:56:58] thcipriani: cool! yeah just checking the new functionality [17:59:50] 6operations, 7Database, 5Patch-For-Review: Drop `user_daily_contribs` table from all production wikis - https://phabricator.wikimedia.org/T115711#1825921 (10jcrespo) 5Resolved>3Open This wasn't done, labs is not fixed yet. [18:03:52] thcipriani: new functionality almost completely confirmed OK, just one small detail more to check [18:04:02] 6operations, 6Labs: Setup private docker registry with authentication support in tools - https://phabricator.wikimedia.org/T118758#1825939 (10yuvipanda) p:5Triage>3Normal [18:04:51] 7Puppet, 6Labs, 10Tool-Labs: Document our GridEngine set up - https://phabricator.wikimedia.org/T88733#1825945 (10coren) a:5coren>3None [18:13:22] 6operations, 6Labs, 10Tool-Labs, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1825994 (10Luke081515) [18:14:51] RECOVERY - puppet last run on mw1122 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [18:15:19] (03PS1) 10Dzahn: magnesium: include standard [puppet] - 10https://gerrit.wikimedia.org/r/254891 [18:16:00] RECOVERY - HTTP 5xx reqs/min anomaly on graphite1001 is OK: OK: No anomaly detected [18:16:04] (03PS2) 10Dzahn: magnesium: include standard [puppet] - 10https://gerrit.wikimedia.org/r/254891 [18:17:09] (03CR) 10Dzahn: [C: 032] magnesium: include standard [puppet] - 10https://gerrit.wikimedia.org/r/254891 (owner: 10Dzahn) [18:23:40] 6operations, 6Labs: Untangle labs/production roles from labs/instance roles - https://phabricator.wikimedia.org/T119401#1826037 (10Andrew) a:5Andrew>3yuvipanda [18:24:36] thcipriani: K yes all set with everything, thanks again! [18:25:15] AndyRussG: awesome! Thanks for checking everything :) [18:25:22] likewise thanks! :) [18:33:47] (03PS2) 10Dzahn: magnesium/RT: standard::has_default_mail_relay: false [puppet] - 10https://gerrit.wikimedia.org/r/254895 [18:33:49] (03CR) 10Dzahn: [C: 032] magnesium/RT: standard::has_default_mail_relay: false [puppet] - 10https://gerrit.wikimedia.org/r/254895 (owner: 10Dzahn) [18:35:02] (03CR) 10Dzahn: "probably this was the case because RT and RT got mixed up and this was supposed to be just in requesttracker all the time. racktables does" [puppet] - 10https://gerrit.wikimedia.org/r/254895 (owner: 10Dzahn) [18:36:03] 6operations, 6Labs, 10Tool-Labs, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1826080 (10jcrespo) Short term fix done on labsdb1002, too: ``` MariaDB LABS localhost (none) > pager grep Seconds PAGER set to 'grep Seconds' MariaDB LABS local... [18:39:37] 6operations, 6Labs, 10Tool-Labs, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1826094 (10Multichill) @jcrespo: Do you update dns entries before you kill a database server? We use names like "wikidatawiki.labsdb" and that way the impact is r... [18:41:14] 6operations, 6Labs, 10Tool-Labs, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1826097 (10jcrespo) @Multichill No, that was done on pourpose. The servers were killed because an OOM, as we have 2/3 servers down, redirecting to a single server... [18:43:09] So, YuviPanda, I will create a longer report once the whole thing gets fixed [18:43:19] but now we should be in good track [18:43:22] \o/ [18:43:28] are things recovreed or still recovering? [18:43:30] * YuviPanda reads ticket [18:43:40] you do not need to read it all [18:43:54] let me update you, and I can paste this there afterwards [18:44:08] awesome [18:44:12] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [18:44:23] 6operations, 6Labs, 10Tool-Labs, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1826103 (10Betacommand) I know this is more work, but I think the community would prefer that you do a fresh import of the data. Having an unknown volume of data... [18:44:29] oh, we are not in labs [18:44:34] wrong channel [18:46:54] 6operations, 6Labs, 10Tool-Labs, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1826118 (10jcrespo) @Betacommand I hear you, but doing a full import will take down the servers for 5 days. Having only 1 server available and up to date is not a... [18:55:04] (03CR) 10Dzahn: "@Odder in a https-only context we would only want to link that domain to actually redirect if we'd also spend the money to have an SSL cer" [dns] - 10https://gerrit.wikimedia.org/r/254294 (owner: 10Odder) [18:59:27] (03CR) 10Dzahn: [C: 04-1] "@Odder we'd only want to merge it if we actually had an SSL cert for it / if wikiquote.pl was added on a cert" [puppet] - 10https://gerrit.wikimedia.org/r/254305 (owner: 10Odder) [19:00:02] (03CR) 10Jdrewniak: [C: 031] Add jgirault and jdrewniak to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/254409 (owner: 10Andrew Bogott) [19:11:41] mutante: ohaio [19:11:45] (03PS2) 10Dzahn: Add addshore.com to planet.wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/254653 (owner: 10Addshore) [19:12:07] (03CR) 10Dzahn: [C: 032] Add addshore.com to planet.wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/254653 (owner: 10Addshore) [19:18:44] (03PS2) 10Dzahn: [Planet Wikimedia] Add MisterSanderson to Portuguese planet [puppet] - 10https://gerrit.wikimedia.org/r/254676 (owner: 10Nemo bis) [19:20:30] 6operations, 6Labs, 10Tool-Labs, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1826180 (10jcrespo) So, to give you a general update: 1. After an upgrade and configuration change to workaround a bug in the storage engine/paralel replication,... [19:32:20] 6operations, 6Labs, 10Tool-Labs, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1826209 (10jcrespo) p:5Unbreak!>3High [19:38:38] (03CR) 10Dzahn: [C: 032] [Planet Wikimedia] Add MisterSanderson to Portuguese planet [puppet] - 10https://gerrit.wikimedia.org/r/254676 (owner: 10Nemo bis) [19:43:23] (03PS1) 10Jhobs: [WIP] Third QuickSurveys external survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254908 (https://phabricator.wikimedia.org/T116433) [19:47:28] (03CR) 10Jhobs: [C: 04-1] "Placeholder to schedule SWAT deploy with proper gerrit id." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254908 (https://phabricator.wikimedia.org/T116433) (owner: 10Jhobs) [19:48:51] (03CR) 10Nikerabbit: CX: Enable article-recommender-1 campaign as default (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254117 (https://phabricator.wikimedia.org/T118033) (owner: 10KartikMistry) [20:23:24] (03PS1) 10Reedy: Add jobqueue-labs.php to noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254917 [20:30:49] 6operations, 10hardware-requests: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1035333 (10RobH) [20:36:41] 6operations, 6Labs, 10Tool-Labs, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1826434 (10jcrespo) p:5High>3Unbreak! Labsdb1002 crashed again, creating new corruption on some user tables: ``` 151123 20:02:14 [ERROR] mysqld: Table './p50... [20:37:22] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [100000000.0] [20:42:53] (03PS1) 10Rush: git-ssh.wm.o IPv6 reservation [dns] - 10https://gerrit.wikimedia.org/r/254928 [20:45:21] (03PS1) 10Jcrespo: Commenting options that are causing replication issues [puppet] - 10https://gerrit.wikimedia.org/r/254929 [20:46:07] (03CR) 10Jcrespo: [C: 032] "Need to apply this ASAP." [puppet] - 10https://gerrit.wikimedia.org/r/254929 (owner: 10Jcrespo) [20:47:37] !log restarting labsdb1002, hopfully for the last time [20:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:50:05] _joe_: did you merge your ssl patches yet? [20:50:12] MaxSem: reports self hosted puppetmaster is broken [20:50:21] * YuviPanda verifies [20:58:35] (03PS1) 10Jcrespo: Make replication start automatically on labsdb hosts [puppet] - 10https://gerrit.wikimedia.org/r/254936 [20:59:53] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [21:03:29] (03CR) 10Jcrespo: [C: 032] Make replication start automatically on labsdb hosts [puppet] - 10https://gerrit.wikimedia.org/r/254936 (owner: 10Jcrespo) [21:03:40] (03PS1) 10coren: Fix the script not passing arguments to the binary [debs/nfsd-ldap] - 10https://gerrit.wikimedia.org/r/254941 (https://phabricator.wikimedia.org/T87870) [21:04:05] andrewbogott: ^^ [21:05:09] Coren: won’t quoting “$@“ squish all the args into one big unparseable super-arg? [21:05:29] No, $@ is magical for that specific purpose. You're thinking "$*" [21:06:35] hm… then what’s the difference between $@ and “$@" [21:06:49] I must not be counting the number of passes correctly [21:07:36] http://stackoverflow.com/questions/3307672/whats-the-difference-between-and-in-unix [21:07:36] Look at the second answer which is the correct one. :-) [21:08:20] 6operations, 6Labs, 10Tool-Labs, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1826523 (10jcrespo) After fixing this issue (and setting so that if databases crash, they can recover automatically), I will enforce stricter per-user query limit... [21:08:21] oh! that’s… handy but unexpected [21:08:46] (03CR) 10Andrew Bogott: [C: 031] "this seems better :)" [debs/nfsd-ldap] - 10https://gerrit.wikimedia.org/r/254941 (https://phabricator.wikimedia.org/T87870) (owner: 10coren) [21:08:52] It's very expected if you've been messing with shell expansion since the original bourne. :-) [21:09:05] I don't even think about it by now, it's just reflex. [21:09:37] (03CR) 10coren: [C: 032 V: 032] "This makes it actually work in non-trivial cases" [debs/nfsd-ldap] - 10https://gerrit.wikimedia.org/r/254941 (https://phabricator.wikimedia.org/T87870) (owner: 10coren) [21:10:31] * Coren builds anew [21:12:02] (03PS1) 10Andrew Bogott: Designate: No longer require paramiko [puppet] - 10https://gerrit.wikimedia.org/r/254944 (https://phabricator.wikimedia.org/T119408) [21:13:48] (03CR) 10Andrew Bogott: [C: 032] Designate: No longer require paramiko [puppet] - 10https://gerrit.wikimedia.org/r/254944 (https://phabricator.wikimedia.org/T119408) (owner: 10Andrew Bogott) [21:15:53] !log enforcing limits on running queries on all labsdb databases [21:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:16:01] !log Updated Wikidata's property suggester with data from today's json dump [21:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:19:23] sjoerddebruin: ^ [21:23:54] 6operations, 7Mail: EXIM Config: Remove yana alias for ywelinder - https://phabricator.wikimedia.org/T118899#1826583 (10Dzahn) a:3Dzahn [21:24:22] 6operations, 7Mail: EXIM Config: Remove yana alias for ywelinder - https://phabricator.wikimedia.org/T118899#1812033 (10Dzahn) done. the alias has been removed. max a couple minutes until puppet run on the mailservers [21:24:29] 6operations, 7Mail: EXIM Config: Remove yana alias for ywelinder - https://phabricator.wikimedia.org/T118899#1826585 (10Dzahn) 5Open>3Resolved [21:36:40] 6operations, 7Database, 5Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1826605 (10JanZerebecki) > With our current setup, it takes 1 year to to a rolling restart of all our database boxes, and probably around 1000 man-hours :-O I didn't expect that. Is it c... [21:41:03] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [21:45:49] (03PS6) 10Dzahn: sudo journalctl: make missing restrictions obvious [puppet] - 10https://gerrit.wikimedia.org/r/251714 (https://phabricator.wikimedia.org/T115067) (owner: 10JanZerebecki) [21:56:20] (03PS5) 10Rush: admin: allow all active users to be applied [puppet] - 10https://gerrit.wikimedia.org/r/244471 (https://phabricator.wikimedia.org/T114161) [22:04:15] (03CR) 10ArielGlenn: "Hey Jdrewniak, can you please sign that L3 form? It's linked on the phab task for access for you guys.." [puppet] - 10https://gerrit.wikimedia.org/r/254409 (owner: 10Andrew Bogott) [22:04:32] hoo|away: <3 [22:09:59] (03CR) 10Alex Monk: [C: 04-1] "No linked phabricator task" [puppet] - 10https://gerrit.wikimedia.org/r/254409 (owner: 10Andrew Bogott) [22:10:05] (03PS1) 10coren: Put nsswitch_use_default hiera variable somewhere it will be used [puppet] - 10https://gerrit.wikimedia.org/r/255027 [22:10:14] mutante: ^^ this puts it into the role [22:10:31] 6operations, 10Wikimedia-SVG-rendering, 7Upstream: Filter effect Gaussian blur filter not rendered correctly for small to medium thumbnail sizes - https://phabricator.wikimedia.org/T44090#1826711 (10matmarex) As noted above, this issue has been fixed in the rsvg library we use to convert SVG images to PNG. T... [22:12:10] 10Ops-Access-Requests, 6operations: Grant jgirault an jan_drewniak access to the eventlogging db on stat1003 and hive to query webrequests tables on stat1002 - https://phabricator.wikimedia.org/T118998#1826713 (10Jdrewniak) @ArielGlenn @Andrew yes, I signed the L3 earlier today. [22:12:30] Coren: cool, that's the nicer way anyways. it will work because the role keyword is already used in site.pp for that, yea [22:12:58] (03CR) 10Dzahn: [C: 031] Put nsswitch_use_default hiera variable somewhere it will be used [puppet] - 10https://gerrit.wikimedia.org/r/255027 (owner: 10coren) [22:13:20] (03CR) 10coren: [C: 032] Put nsswitch_use_default hiera variable somewhere it will be used [puppet] - 10https://gerrit.wikimedia.org/r/255027 (owner: 10coren) [22:15:12] PROBLEM - check_puppetrun on bismuth is CRITICAL: CRITICAL: Puppet has 2 failures [22:17:16] 6operations, 10Traffic, 6WMF-Legal: Policy decisions for new (and current) DNS domains registered to the WMF - https://phabricator.wikimedia.org/T101048#1826734 (10Platonides) Is there any technical reason not to have the servers using 700 certificates, using SNI for most of them? I propose WMF to get certi... [22:20:12] PROBLEM - check_puppetrun on bismuth is CRITICAL: CRITICAL: Puppet has 2 failures [22:22:19] (03CR) 10Dzahn: "as it is right now it seems you are just setting it for git-ssh.eqiad but not for git-ssh.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/254928 (owner: 10Rush) [22:24:11] (03PS2) 10Rush: git-ssh.wm.o IPv6 reservation [dns] - 10https://gerrit.wikimedia.org/r/254928 [22:24:47] (03CR) 10Dzahn: git-ssh.wm.o IPv6 reservation (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/254928 (owner: 10Rush) [22:25:12] RECOVERY - check_puppetrun on bismuth is OK: OK: Puppet is currently enabled, last run 91 seconds ago with 0 failures [22:28:43] <_joe_> YuviPanda: no I didn't, and the self-hosted puppetmaster was ok on friday (I built one) [22:34:23] (03CR) 10Dzahn: [C: 031] "it looks like it will work, just the CNAME or not question is maybe something to decide globally and then either change all or none" [dns] - 10https://gerrit.wikimedia.org/r/254928 (owner: 10Rush) [22:34:46] _joe_: ok I'll check what's up with MaxSem [22:34:50] ebernhardson: around? [22:36:14] 6operations, 10Wikimedia-DNS, 5Patch-For-Review, 7domains: wikipedia.es - https://phabricator.wikimedia.org/T101060#1826789 (10Platonides) The name servers still haven't been switched. [22:37:35] YuviPanda: yup [22:37:38] 6operations, 10Wikimedia-DNS, 5Patch-For-Review, 7domains: wikipedia.es - https://phabricator.wikimedia.org/T101060#1826796 (10Dzahn) [22:37:52] ebernhardson: thoughts on nobelium? :) [22:37:57] 6operations, 10Wikimedia-DNS, 5Patch-For-Review, 7domains: wikipedia.es - https://phabricator.wikimedia.org/T101060#1327626 (10Dzahn) adding Jacob Rogers (legal) to some domain related tickets [22:38:03] YuviPanda: well, i guess choose a few more wiki's to turn off and try again? [22:38:55] YuviPanda: not sure which to turn off, last time we did all but english and german and it still had issues [22:39:37] i guess we could go the other way around, only turn on english and german? [22:39:48] ebernhardson: yeah [22:40:02] ebernhardson: and if that fails just turn on english [22:40:02] 6operations, 10Wiki-Loves-Monuments-General, 10Wikimedia-DNS, 5Patch-For-Review, 7domains: point wikilovesmonument.org ns to wmf - https://phabricator.wikimedia.org/T118468#1801065 (10Dzahn) adding Jacob Rogers (legal) to some domain related tickets [22:40:17] 6operations, 10Wikimedia-DNS, 7domains: Transfer of domain names to WMF servers - https://phabricator.wikimedia.org/T114922#1709675 (10Dzahn) adding Jacob Rogers (legal) to some domain related tickets [22:40:52] (03PS1) 10EBernhardson: Enable labs ES replica for english and german [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255033 [22:41:12] (03CR) 10jenkins-bot: [V: 04-1] Enable labs ES replica for english and german [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255033 (owner: 10EBernhardson) [22:41:22] :S [22:44:06] 6operations, 10Deployment-Systems, 6Performance-Team, 7HHVM: Make scap able to depool/repool servers via the conftool API - https://phabricator.wikimedia.org/T104352#1826816 (10thcipriani) Started talking about this at the deployment [[https://www.mediawiki.org/wiki/Deployment_tooling/Cabal/2015-11-23#etcd... [22:44:20] 6operations, 10Deployment-Systems, 6Performance-Team, 7HHVM, 3Scap3: Make scap able to depool/repool servers via the conftool API - https://phabricator.wikimedia.org/T104352#1826818 (10thcipriani) [22:45:34] PROBLEM - Host barium is DOWN: PING CRITICAL - Packet loss = 100% [22:45:42] PROBLEM - Host lutetium is DOWN: PING CRITICAL - Packet loss = 100% [22:45:56] (03PS2) 10EBernhardson: Enable labs ES replica for english and german [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255033 [22:45:57] uh [22:46:04] is this the power switch again? [22:47:13] 22:46 < ebernhard> anyone mind if i take the 3-4 (starts in 15min) deploy window to try again at enabling the ES labs replica writes? [22:47:17] sure thing [22:47:20] ebernhardson: ^ :) [22:47:25] woo [22:47:30] greg-g: thanks [22:47:32] PROBLEM - Host americium is DOWN: PING CRITICAL - Packet loss = 100% [22:47:35] uh [22:47:38] PROBLEM - Host db1008 is DOWN: PING CRITICAL - Packet loss = 100% [22:47:43] robh: ping: re frack [22:47:44] PROBLEM - Host pay-lvs1001 is DOWN: PING CRITICAL - Packet loss = 100% [22:47:59] oh nvm [22:48:02] it's taken care of [22:48:11] heh, my phone just exploded! [22:49:04] PROBLEM - Host payments1001 is DOWN: PING CRITICAL - Packet loss = 100% [22:49:11] PROBLEM - Host payments1003 is DOWN: PING CRITICAL - Packet loss = 100% [22:49:49] PROBLEM - Host tellurium is DOWN: PING CRITICAL - Packet loss = 100% [22:49:56] PROBLEM - Host silicon is DOWN: PING CRITICAL - Packet loss = 100% [22:50:14] RECOVERY - Host tellurium is UP: PING OK - Packet loss = 0%, RTA = 6.36 ms [22:50:19] RECOVERY - Host db1008 is UP: PING OK - Packet loss = 0%, RTA = 3.93 ms [22:50:26] RECOVERY - Host barium is UP: PING OK - Packet loss = 0%, RTA = 5.27 ms [22:50:34] RECOVERY - Host lutetium is UP: PING OK - Packet loss = 0%, RTA = 2.26 ms [22:50:40] RECOVERY - Host payments1001 is UP: PING OK - Packet loss = 0%, RTA = 0.92 ms [22:50:47] RECOVERY - Host americium is UP: PING OK - Packet loss = 0%, RTA = 5.30 ms [22:50:53] RECOVERY - Host payments1003 is UP: PING OK - Packet loss = 0%, RTA = 1.89 ms [22:51:02] RECOVERY - Host silicon is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [22:51:38] (03PS3) 10EBernhardson: Enable labs ES replica for english and german [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255033 (https://phabricator.wikimedia.org/T109715) [22:53:49] RECOVERY - Host pay-lvs1001 is UP: PING OK - Packet loss = 0%, RTA = 1.67 ms [22:55:09] PROBLEM - check_puppetrun on payments2003 is CRITICAL: CRITICAL: Puppet has 6 failures [22:55:09] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet has 72 failures [22:55:10] 6operations, 10Traffic, 6WMF-Legal: Policy decisions for new (and current) DNS domains registered to the WMF - https://phabricator.wikimedia.org/T101048#1826868 (10BBlack) >>! In T101048#1826734, @Platonides wrote: > Is there any technical reason not to have the servers using 700 certificates, using SNI for... [23:00:08] PROBLEM - check_puppetrun on rigel is CRITICAL: CRITICAL: Puppet has 51 failures [23:00:09] PROBLEM - check_puppetrun on alnitak is CRITICAL: CRITICAL: Puppet has 36 failures [23:00:09] PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 49 failures [23:00:10] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet has 72 failures [23:00:10] PROBLEM - check_puppetrun on fdb2001 is CRITICAL: CRITICAL: Puppet has 37 failures [23:00:10] PROBLEM - check_puppetrun on alnilam is CRITICAL: CRITICAL: Puppet has 47 failures [23:00:10] PROBLEM - check_puppetrun on betelgeuse is CRITICAL: CRITICAL: Puppet has 60 failures [23:00:10] PROBLEM - check_puppetrun on payments2003 is CRITICAL: CRITICAL: puppet fail [23:02:54] (03CR) 10EBernhardson: [C: 032] Enable labs ES replica for english and german [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255033 (https://phabricator.wikimedia.org/T109715) (owner: 10EBernhardson) [23:04:17] (03Merged) 10jenkins-bot: Enable labs ES replica for english and german [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255033 (https://phabricator.wikimedia.org/T109715) (owner: 10EBernhardson) [23:05:09] PROBLEM - check_puppetrun on betelgeuse is CRITICAL: CRITICAL: Puppet has 60 failures [23:05:09] PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 49 failures [23:05:09] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 55 failures [23:05:09] PROBLEM - check_puppetrun on alnitak is CRITICAL: CRITICAL: Puppet has 36 failures [23:05:09] PROBLEM - check_puppetrun on fdb2001 is CRITICAL: CRITICAL: Puppet has 37 failures [23:05:10] PROBLEM - check_puppetrun on rigel is CRITICAL: CRITICAL: Puppet has 51 failures [23:05:10] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet has 72 failures [23:05:10] PROBLEM - check_puppetrun on saiph is CRITICAL: CRITICAL: Puppet has 42 failures [23:05:11] PROBLEM - check_puppetrun on payments2003 is CRITICAL: CRITICAL: puppet fail [23:05:11] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: puppet fail [23:05:12] PROBLEM - check_puppetrun on alnilam is CRITICAL: CRITICAL: Puppet has 47 failures [23:08:10] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Enable elasticsearch labs replica for enwiki and dewiki (duration: 00m 28s) [23:08:12] YuviPanda: ^^ [23:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:08:33] 6operations, 7Monitoring: "ensure legal html" footer monitoring turned CRIT - https://phabricator.wikimedia.org/T119456#1826946 (10Dzahn) 3NEW [23:08:58] 6operations, 7Monitoring: "ensure legal html" footer monitoring turned CRIT - https://phabricator.wikimedia.org/T119456#1826953 (10Dzahn) a:3chasemp [23:09:06] 6operations, 7Monitoring: "ensure legal html" footer monitoring turned CRIT - https://phabricator.wikimedia.org/T119456#1826946 (10Dzahn) p:5Triage>3Normal [23:09:29] ACKNOWLEDGEMENT - Ensure legal html en.m.wp on en.m.wikipedia.org is CRITICAL: a href=//wikimediafoundation.org/wiki/Privacy_policyPrivacy/a html not found daniel_zahn https://phabricator.wikimedia.org/T119456 [23:09:29] ACKNOWLEDGEMENT - Ensure legal html en.wb on en.wikibooks.org is CRITICAL: a href=//wikimediafoundation.org/wiki/Privacy_policyPrivacy policy/a html not found daniel_zahn https://phabricator.wikimedia.org/T119456 [23:09:29] ACKNOWLEDGEMENT - Ensure legal html en.wp on en.wikipedia.org is CRITICAL: a href=//wikimediafoundation.org/wiki/Privacy_policyPrivacy policy/a html not found daniel_zahn https://phabricator.wikimedia.org/T119456 [23:10:08] RECOVERY - check_puppetrun on rigel is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [23:10:08] PROBLEM - check_puppetrun on betelgeuse is CRITICAL: CRITICAL: Puppet has 60 failures [23:10:09] RECOVERY - check_puppetrun on alnilam is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [23:10:09] PROBLEM - check_puppetrun on alnitak is CRITICAL: CRITICAL: Puppet has 36 failures [23:10:10] PROBLEM - check_puppetrun on fdb2001 is CRITICAL: CRITICAL: Puppet has 37 failures [23:10:10] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 189 seconds ago with 0 failures [23:10:10] PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 49 failures [23:10:10] RECOVERY - check_puppetrun on mintaka is OK: OK: Puppet is currently enabled, last run 150 seconds ago with 0 failures [23:10:10] PROBLEM - check_puppetrun on payments2003 is CRITICAL: CRITICAL: puppet fail [23:10:11] PROBLEM - check_puppetrun on saiph is CRITICAL: CRITICAL: Puppet has 42 failures [23:10:12] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: puppet fail [23:11:07] YuviPanda: not sure, but will wait and see. nobelium is already reporting disk write in the 30MB/s range which is where we maxed out before [23:13:45] 6operations, 6Reading-Admin: Improve UX Strategic Test - https://phabricator.wikimedia.org/T117826#1826975 (10dr0ptp4kt) Here may be a guide for getting this in front of the right people. https://meta.wikimedia.org/wiki/User:Halfak_%28WMF%29/Wikipedian_recruitment [23:15:08] PROBLEM - check_puppetrun on betelgeuse is CRITICAL: CRITICAL: Puppet has 60 failures [23:15:09] PROBLEM - check_puppetrun on alnitak is CRITICAL: CRITICAL: Puppet has 36 failures [23:15:09] PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 49 failures [23:15:09] PROBLEM - check_puppetrun on saiph is CRITICAL: CRITICAL: Puppet has 42 failures [23:15:09] PROBLEM - check_puppetrun on fdb2001 is CRITICAL: CRITICAL: Puppet has 37 failures [23:15:10] PROBLEM - check_puppetrun on payments2003 is CRITICAL: CRITICAL: puppet fail [23:15:10] RECOVERY - check_puppetrun on payments2002 is OK: OK: Puppet is currently enabled, last run 142 seconds ago with 0 failures [23:18:14] 6operations, 6Analytics-Kanban, 7Database: db1046 running out of disk space - https://phabricator.wikimedia.org/T119380#1826987 (10Milimetric) @jcrespo: I think m4-master holds on to data for too long. Since everything is replicated to analytics-store, I think we can shorten the amount of time data lives on... [23:20:08] PROBLEM - check_puppetrun on betelgeuse is CRITICAL: CRITICAL: Puppet has 60 failures [23:20:11] PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 49 failures [23:20:11] PROBLEM - check_puppetrun on fdb2001 is CRITICAL: CRITICAL: Puppet has 37 failures [23:20:11] RECOVERY - check_puppetrun on saiph is OK: OK: Puppet is currently enabled, last run 294 seconds ago with 0 failures [23:20:11] PROBLEM - check_puppetrun on alnitak is CRITICAL: CRITICAL: Puppet has 36 failures [23:20:11] RECOVERY - check_puppetrun on payments2003 is OK: OK: Puppet is currently enabled, last run 206 seconds ago with 0 failures [23:23:15] YuviPanda: actually it seems to have only spiked briefly, nobelium has been hanging out somewhere arround 13MB/s disk write and ~250 iops [23:25:08] RECOVERY - check_puppetrun on betelgeuse is OK: OK: Puppet is currently enabled, last run 164 seconds ago with 0 failures [23:25:09] RECOVERY - check_puppetrun on alnitak is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [23:25:09] RECOVERY - check_puppetrun on fdb2001 is OK: OK: Puppet is currently enabled, last run 255 seconds ago with 0 failures [23:25:10] RECOVERY - check_puppetrun on bellatrix is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [23:30:54] (03PS1) 10Rush: phab: strip spaces around addresses [puppet] - 10https://gerrit.wikimedia.org/r/255043 [23:31:07] (03PS2) 10Rush: phab: strip spaces around addresses [puppet] - 10https://gerrit.wikimedia.org/r/255043 [23:33:02] (03CR) 10Rush: [C: 032] phab: strip spaces around addresses [puppet] - 10https://gerrit.wikimedia.org/r/255043 (owner: 10Rush) [23:35:33] (03PS1) 10Andrew Bogott: Rename holmium to labservices1002. [dns] - 10https://gerrit.wikimedia.org/r/255047 (https://phabricator.wikimedia.org/T106303) [23:39:48] 6operations, 10Phabricator-Bot-Requests, 10procurement, 5Patch-For-Review: update emailbot to allow cc: for #procurement - https://phabricator.wikimedia.org/T117113#1827033 (10chasemp) 5Open>3Resolved hopefully resolved [23:39:55] 6operations, 10Phabricator-Bot-Requests, 10procurement, 5Patch-For-Review: update emailbot to allow cc: for #procurement - https://phabricator.wikimedia.org/T117113#1827036 (10chasemp) [23:43:46] 6operations, 10Beta-Cluster-Infrastructure, 10netops, 7Database: Evaluate security concerns of logging beta cluster db queries on tendril - https://phabricator.wikimedia.org/T119461#1827042 (10jcrespo) 3NEW [23:45:09] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 2 failures [23:50:08] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 2 failures [23:50:09] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 2 failures [23:51:16] (03CR) 10Rush: [C: 032] git-ssh.wm.o IPv6 reservation [dns] - 10https://gerrit.wikimedia.org/r/254928 (owner: 10Rush) [23:55:08] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 2 failures [23:55:09] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 2 failures [23:55:45] !log maxsem@tin Synchronized portals: (no message) (duration: 00m 28s) [23:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:58:15] chasemp: how are you with apt pinning? Do you have a minute to help me understand something? [23:58:35] I understand the general idea, what are you looking at? [23:58:45] if you log into holmium and look in /etc/apt... [23:58:59] I want apt to prefer packages from the ubuntucloud repo over things from the wikimedia repo