[00:00:39] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [00:01:08] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [00:06:38] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:09:18] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:18:48] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:28:48] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [00:30:38] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [00:31:18] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [00:36:38] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:39:18] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:48:48] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:58:49] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [01:00:38] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [01:01:18] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [01:06:38] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:09:18] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:18:48] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:27:48] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [01:30:38] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [01:31:18] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [01:36:38] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:39:18] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:41:48] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [01:46:48] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 18 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [01:48:48] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:58:48] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [02:00:38] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [02:01:18] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [02:06:38] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:09:18] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:18:49] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:27:48] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [02:30:38] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [02:31:18] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [02:36:38] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:39:18] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:48:48] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:58:48] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [03:00:37] 06Operations, 06Services (blocked): Set up grafana alerting for services - https://phabricator.wikimedia.org/T162765#3174020 (10GWicke) [03:00:38] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [03:01:00] 06Operations, 06Services (blocked): Set up grafana alerting for services - https://phabricator.wikimedia.org/T162765#3174034 (10GWicke) [03:01:18] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [03:05:41] 06Operations, 06Services (blocked): Set up grafana alerting for services - https://phabricator.wikimedia.org/T162765#3174047 (10GWicke) [03:06:38] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:09:18] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:18:49] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:22:53] 06Operations, 06Services (blocked): Set up grafana alerting for services - https://phabricator.wikimedia.org/T162765#3174055 (10GWicke) [03:28:49] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [03:30:38] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [03:31:18] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [03:34:59] (03CR) 10MZMcBride: "It might be nice to do this in Python 3. Python 2.7 is getting pretty old." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346234 (https://phabricator.wikimedia.org/T98640) (owner: 10Krinkle) [03:36:38] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:39:18] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:48:48] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:57:56] (03CR) 10Dereckson: [C: 04-1] "This doesn't match the current workflow: the script should start from SVG version and resize it accordingly in 1x 1.5x 2x." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346234 (https://phabricator.wikimedia.org/T98640) (owner: 10Krinkle) [03:58:48] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [04:00:38] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [04:02:18] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [04:06:38] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 276 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [04:06:39] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:09:18] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:11:08] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:11:38] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 16 probes of 276 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [04:18:48] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:27:48] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [04:30:39] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [04:31:18] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [04:36:38] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:38:08] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [04:39:18] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:48:48] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:53:48] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [04:53:52] !log started `mwscriptwikiset refreshLinks.php small.dblist` on terbium [04:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:58:48] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 18 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [04:58:49] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [05:00:38] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [05:01:18] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [05:06:38] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:09:18] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:18:48] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:28:48] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [05:29:39] could someone add me to the Triagers group in Phab? [05:29:41] https://phabricator.wikimedia.org/project/profile/13/ [05:30:38] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [05:31:18] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [05:36:38] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:39:18] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:48:48] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:56:37] !log Deploy alter table on db2108 codfw master (s3, image table) - T160415 [05:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:45] T160415: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415 [05:57:48] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [05:58:50] (03PS1) 10Marostegui: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347805 (https://phabricator.wikimedia.org/T17441) [06:00:17] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347805 (https://phabricator.wikimedia.org/T17441) (owner: 10Marostegui) [06:00:39] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:01:33] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347805 (https://phabricator.wikimedia.org/T17441) (owner: 10Marostegui) [06:01:43] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347805 (https://phabricator.wikimedia.org/T17441) (owner: 10Marostegui) [06:02:18] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:04:14] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1093 (duration: 02m 00s) [06:04:18] !log Deploy schema change on s6 - db1093 - T17441 [06:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:28] T17441: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441 [06:06:38] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:09:18] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:13:06] !log Deploy alter table on db1075 eqiad master (s3, image table) - T160415 [06:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:14] T160415: Review schema changes for T125071 - Add index to image table on all wikis - https://phabricator.wikimedia.org/T160415 [06:18:48] PROBLEM - puppet last run on tempdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:20:01] <_joe_> !log killing badly-started puppet agents on mc1010, tempdb2001,db1090, db2058, hydrogen, possibly others later [06:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:27] 06Operations, 10ops-eqdfw, 10Analytics, 06DC-Ops: SATA errors for stat1004 in the dmesg - https://phabricator.wikimedia.org/T162770#3174169 (10elukey) [06:22:18] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:24:58] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347808 [06:25:02] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347808 [06:28:09] (03PS1) 10Elukey: Set the correct partman recipe for mw2246 and mw2152 [puppet] - 10https://gerrit.wikimedia.org/r/347809 [06:28:14] <_joe_> !log killing long-running puppet-agent on db2058 too [06:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:48] RECOVERY - puppet last run on tempdb2001 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:29:45] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347808 (owner: 10Marostegui) [06:30:19] thanks _joe_ for fixing those, I was taking a look at db1090 when saw it on icinga but you were already on it [06:30:32] (03CR) 10Elukey: [C: 032] Set the correct partman recipe for mw2246 and mw2152 [puppet] - 10https://gerrit.wikimedia.org/r/347809 (owner: 10Elukey) [06:30:38] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:31:36] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347808 (owner: 10Marostegui) [06:31:45] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347808 (owner: 10Marostegui) [06:32:49] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1072 - T132416 (duration: 00m 46s) [06:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:56] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [06:36:09] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347810 [06:36:14] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347810 [06:37:10] !log reimage mw2246.codfw.wmnet mw2152.codfw.wmnet to remove the /tmp partition (codfw videoscalers, switchover prep) [06:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:14] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347810 (owner: 10Marostegui) [06:40:25] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347810 (owner: 10Marostegui) [06:40:40] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347810 (owner: 10Marostegui) [06:41:28] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1129.70 Read Requests/Sec=315.20 Write Requests/Sec=2.70 KBytes Read/Sec=39913.20 KBytes_Written/Sec=986.00 [06:42:01] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1093 - T17441 (duration: 00m 46s) [06:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:09] T17441: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441 [06:50:28] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=58.50 Read Requests/Sec=254.10 Write Requests/Sec=3.00 KBytes Read/Sec=5379.20 KBytes_Written/Sec=235.60 [06:50:41] <_joe_> !log testing switchover codfw => eqiad, no destructive actions will be taken [06:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:33] !log switchdc (oblivian@sarin) START TASK - switchdc.stages.t01_stop_maintenance(codfw, eqiad) Stop MediaWiki maintenance in the old master DC [06:53:35] !log switchdc (oblivian@sarin) END TASK - switchdc.stages.t01_stop_maintenance(codfw, eqiad) Failed to execute [06:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:53] !log switchdc (oblivian@sarin) START TASK - switchdc.stages.t02_start_mediawiki_readonly(codfw, eqiad) Set MediaWiki in read-only mode (db_from config already merged and git pulled) [06:56:54] !log switchdc (oblivian@sarin) MediaWiki read-only period starts at: 2017-04-12 06:56:53.822926 [06:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:18] !log root@tin Synchronized wmf-config/db-codfw.php: Set MediaWiki in read-only mode in datacenter codfw (duration: 00m 23s) [06:57:18] !log switchdc (oblivian@sarin) END TASK - switchdc.stages.t02_start_mediawiki_readonly(codfw, eqiad) Failed to execute [06:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:38] <_joe_> !log the last messages are just a test and nothing was really done, as codfw is already in read-only mode right now [06:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:16] (03PS1) 10Elukey: Set Trusty as PXE image for mw2152 (videoscaler) [puppet] - 10https://gerrit.wikimedia.org/r/347811 [07:13:26] yep I just reimaged a videoscaler to Debian [07:14:07] hopefully this will be the last reimage [07:15:32] (03CR) 10Elukey: [C: 032] Set Trusty as PXE image for mw2152 (videoscaler) [puppet] - 10https://gerrit.wikimedia.org/r/347811 (owner: 10Elukey) [07:18:08] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [07:22:25] Telia announced maintenance --^ [07:36:09] (03CR) 10Hashar: [C: 031] "That will be fine assuming nova-compute never fork." [puppet] - 10https://gerrit.wikimedia.org/r/347688 (https://phabricator.wikimedia.org/T162640) (owner: 10Andrew Bogott) [07:42:28] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [07:50:06] 06Operations, 10media-storage, 15User-fgiunchedi: Swift version and distro upgrade - https://phabricator.wikimedia.org/T162609#3174252 (10fgiunchedi) [07:55:11] <_joe_> !log resuming non-dry run tests of switchdc, all logs from switchdc by me are just tests [07:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:55] !log switchdc (oblivian@sarin) START TASK - switchdc.stages.t05_switch_traffic(codfw, eqiad) Switch traffic flow to the appservers in the new datacenter [07:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:56] (03CR) 10Elukey: [C: 032] Add JVM options tunables for Yarn RM and Hadoop DN/NN [puppet/cdh] - 10https://gerrit.wikimedia.org/r/347353 (https://phabricator.wikimedia.org/T159219) (owner: 10Elukey) [07:57:29] 06Operations, 10DBA, 10MediaWiki-API, 10Traffic: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3174291 (10fgiunchedi) p:05Triage>03High I'm triaging as high since there's potential for an outage. Did the block or rate li... [07:58:18] 06Operations, 06Services (blocked): Set up grafana alerting for services - https://phabricator.wikimedia.org/T162765#3174293 (10fgiunchedi) p:05Triage>03Normal [07:58:28] !log switchdc (oblivian@sarin) END TASK - switchdc.stages.t05_switch_traffic(codfw, eqiad) Successfully completed [07:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:34] 06Operations, 06Discovery, 06Maps, 10Traffic, 03Interactive-Sprint: Make maps active / active - https://phabricator.wikimedia.org/T162362#3174294 (10fgiunchedi) p:05Triage>03Normal [07:58:45] 06Operations, 10media-storage, 15User-fgiunchedi: Swift version and distro upgrade - https://phabricator.wikimedia.org/T162609#3174295 (10fgiunchedi) p:05Triage>03High [07:59:55] !log switchdc (oblivian@sarin) START TASK - switchdc.stages.t09_restore_ttl(codfw, eqiad) Restore the TTL of all the MediaWiki discovery records [08:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:02] 06Operations, 10Mail, 10Wikimedia-Mailing-lists: Sender email spoofing - https://phabricator.wikimedia.org/T160529#3174296 (10fgiunchedi) p:05Triage>03Normal [08:00:09] !log switchdc (oblivian@sarin) END TASK - switchdc.stages.t09_restore_ttl(codfw, eqiad) Successfully completed [08:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:47] 06Operations, 06Performance-Team, 10Wikidata, 10Wikimedia-Site-requests, 07Performance: Increase $wgExpensiveParserFunctionLimit on nowiki - https://phabricator.wikimedia.org/T160685#3174297 (10fgiunchedi) p:05Triage>03Normal [08:01:01] 06Operations, 06Commons, 10Datasets-General-or-Unknown, 07Community-Wishlist-Survey-2016: Back up of Commons files - https://phabricator.wikimedia.org/T160229#3174298 (10fgiunchedi) p:05Triage>03Normal [08:02:17] !log switchdc (oblivian@sarin) START TASK - switchdc.stages.t05_switch_datacenter(codfw, eqiad) Switch MediaWiki configuration to the new datacenter [08:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:25] !log switchdc (oblivian@sarin) END TASK - switchdc.stages.t05_switch_datacenter(codfw, eqiad) Successfully completed [08:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:10] (03PS1) 10Elukey: netboot.cfg: set correct priority to the videoscaler's recipe [puppet] - 10https://gerrit.wikimedia.org/r/347813 [08:03:25] yes yes it is becoming ridicolous I know, feel free to mock me :D [08:05:28] 06Operations, 06Discovery, 06Maps, 10Traffic, 03Interactive-Sprint: Make maps active / active - https://phabricator.wikimedia.org/T162362#3160389 (10Pnorman) Unless you take special measures two tile servers with the same style and data may render labels differently. Generally this is caused by queries w... [08:05:40] 06Operations, 10DBA, 10MediaWiki-API, 10Traffic: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3174307 (10Marostegui) Looks like he stopped two days ago: https://grafana.wikimedia.org/dashboard/db/api-summary?orgId=1&from=14... [08:06:38] PROBLEM - parsoid on wtp2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:07:11] (03CR) 10Elukey: [C: 032] netboot.cfg: set correct priority to the videoscaler's recipe [puppet] - 10https://gerrit.wikimedia.org/r/347813 (owner: 10Elukey) [08:07:28] RECOVERY - parsoid on wtp2004 is OK: HTTP OK: HTTP/1.1 200 OK - 1019 bytes in 0.147 second response time [08:09:34] !log switchdc (oblivian@sarin) START TASK - switchdc.stages.t06_redis(codfw, eqiad) Switch the Redis replication [08:09:38] !log switchdc (oblivian@sarin) END TASK - switchdc.stages.t06_redis(codfw, eqiad) Successfully completed [08:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:27] !log switchdc (oblivian@sarin) START TASK - switchdc.stages.t08_stop_mediawiki_readonly(codfw, eqiad) Set MediaWiki in read-write mode (db_to config already merged and git pulled) [08:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:04] !log root@tin Synchronized wmf-config/db-eqiad.php: Set MediaWiki in read-write mode in datacenter eqiad (duration: 00m 35s) [08:14:05] !log switchdc (oblivian@sarin) END TASK - switchdc.stages.t08_stop_mediawiki_readonly(codfw, eqiad) Failed to execute [08:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:43] <_joe_> this ^^ is the 15 minutes vs 3 minutes thing [08:18:08] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [08:18:34] (03PS1) 10Elukey: Set Xms settings for Hadoop Workers daemons [puppet] - 10https://gerrit.wikimedia.org/r/347814 [08:19:42] (03CR) 10jerkins-bot: [V: 04-1] Set Xms settings for Hadoop Workers daemons [puppet] - 10https://gerrit.wikimedia.org/r/347814 (owner: 10Elukey) [08:21:03] argh the : [08:21:30] lol [08:22:48] (03PS2) 10Elukey: Set Xms settings for Hadoop Workers daemons [puppet] - 10https://gerrit.wikimedia.org/r/347814 [08:22:55] !log switchdc (oblivian@sarin) START TASK - switchdc.stages.t09_start_maintenance(codfw, eqiad) Start MediaWiki maintenance in the new master DC [08:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:41] (03PS3) 10Elukey: Set Xms settings for Hadoop Workers daemons [puppet] - 10https://gerrit.wikimedia.org/r/347814 (https://phabricator.wikimedia.org/T159219) [08:24:26] !log switchdc (oblivian@sarin) END TASK - switchdc.stages.t09_start_maintenance(codfw, eqiad) Successfully completed [08:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:43] (03PS4) 10Elukey: Set Xms settings for Hadoop Workers daemons [puppet] - 10https://gerrit.wikimedia.org/r/347814 (https://phabricator.wikimedia.org/T159219) [08:30:53] 06Operations, 10media-storage, 15User-fgiunchedi: Running swiftrepl is not puppetized - https://phabricator.wikimedia.org/T162123#3174367 (10fgiunchedi) [08:31:03] 06Operations, 10media-storage, 15User-fgiunchedi: Swiftrepl was stuck in an infinite loop since days - https://phabricator.wikimedia.org/T162122#3174368 (10fgiunchedi) [08:32:33] (03CR) 10Alexandros Kosiaris: [C: 032] RESTBase: Clean-up unused variables [puppet] - 10https://gerrit.wikimedia.org/r/347676 (owner: 10Mobrovac) [08:32:37] (03PS2) 10Alexandros Kosiaris: RESTBase: Clean-up unused variables [puppet] - 10https://gerrit.wikimedia.org/r/347676 (owner: 10Mobrovac) [08:32:40] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] RESTBase: Clean-up unused variables [puppet] - 10https://gerrit.wikimedia.org/r/347676 (owner: 10Mobrovac) [08:32:42] 06Operations, 10Cassandra, 06Services (done): RAID-0 volume not mounted on restbase-dev1001.eqiad.wmnet - https://phabricator.wikimedia.org/T162614#3174383 (10fgiunchedi) >>! In T162614#3171504, @Eevans wrote: >>>! In T162614#3171494, @Eevans wrote: >>>>! In T162614#3170616, @elukey wrote: > > [ ... ] > >>... [08:38:18] godog: o/ [08:39:11] hi! [08:42:35] 06Operations, 06Labs: Standalone puppet masters are broken (uninstallable packages) - https://phabricator.wikimedia.org/T162462#3174423 (10akosiaris) >>! In T162462#3172669, @jcrespo wrote: > I am commenting this here, please tell me if completely unrelated and I will create a new ticket: > > db1090 keeps fai... [08:43:49] (03PS5) 10Elukey: Set Xms settings for Hadoop Workers daemons [puppet] - 10https://gerrit.wikimedia.org/r/347814 (https://phabricator.wikimedia.org/T159219) [08:44:30] !log reimaging elastic2020 for testing - T149006 [08:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:37] T149006: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006 [08:44:41] (03PS6) 10Elukey: Set Xms settings for Hadoop Workers daemons [puppet] - 10https://gerrit.wikimedia.org/r/347814 (https://phabricator.wikimedia.org/T159219) [08:44:46] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3174429 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2020.codfw.wmnet'... [08:49:53] (03PS7) 10Elukey: Set Xms settings for Hadoop Workers daemons [puppet] - 10https://gerrit.wikimedia.org/r/347814 (https://phabricator.wikimedia.org/T159219) [08:50:59] (03CR) 10Elukey: [V: 032 C: 032] Set Xms settings for Hadoop Workers daemons [puppet] - 10https://gerrit.wikimedia.org/r/347814 (https://phabricator.wikimedia.org/T159219) (owner: 10Elukey) [08:51:04] *twiddles thumbs* [08:51:28] PROBLEM - Disk space on ocg1003 is CRITICAL: DISK CRITICAL - free space: / 1660 MB (3% inode=85%) [08:51:28] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 93575.02 seconds [08:51:36] <_joe_> sigh [08:51:41] <_joe_> ocg, again? [08:51:41] That is because of an alter, it came back from downtime [08:51:43] I will silence it [08:52:15] !log upgrade cache_upload to linux 4.9 T162029 [08:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:22] T162029: Migrate all jessie hosts to Linux 4.9 - https://phabricator.wikimedia.org/T162029 [08:53:00] _joe_ maybe logs? Checking [08:53:08] <_joe_> elukey: no [08:53:35] <_joe_> elukey: do the following: ssh ocg1003 and do df -h [08:53:41] <_joe_> then do the same on ocg1002 [08:54:18] elukey@ocg1003:/srv/deployment/ocg/output$ du -hs [08:54:19] 26G . [08:55:05] * elukey just followed Joe's advice and cries in a corner [08:55:22] <_joe_> you know what's the funniest thing? [08:55:27] <_joe_> it's hard to fix this now [08:58:37] !log addshore@tin Synchronized php-1.29.0-wmf.19/extensions/WikimediaEvents/WikimediaEventsHooks.php: [[gerrit:347815|patch1]] & [[gerrit:347774|patch2]] WMDE Spring campaign PT1/2 (duration: 00m 47s) [08:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:41] !log addshore@tin Synchronized php-1.29.0-wmf.19/extensions/WikimediaEvents/extension.json: [[gerrit:347815|patch1]] & [[gerrit:347774|patch2]] WMDE Spring campaign PT2/2 (duration: 00m 45s) [08:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:45] 06Operations: ogc1003 partitions are severely misconfigured - https://phabricator.wikimedia.org/T162780#3174448 (10Joe) [09:05:47] !log Restarting Jenkins for plugins update [09:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:59] 06Operations: ogc1003 partitions are severely misconfigured - https://phabricator.wikimedia.org/T162780#3174460 (10Joe) p:05Triage>03High a:03Joe [09:06:16] <_joe_> !log creating a LVM volume on ocg1003 [09:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:47] hashar: i was waiting on the beta scap eqiad job! ;) [09:06:48] hehe [09:06:55] ah sorry :(( [09:07:09] no worries [09:07:11] :) [09:07:19] (i was j/k) [09:07:31] phuedx: you can rebuild it via https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/ \O/ [09:07:55] oh that was fast [09:07:56] neato [09:08:00] kicked off a build [09:08:12] yeah Jenkins is quite fast to restart nowadays [09:08:51] 06Operations, 06Labs: Standalone puppet masters are broken (uninstallable packages) - https://phabricator.wikimedia.org/T162462#3174462 (10akosiaris) >>! In T162462#3174423, @akosiaris wrote: >>>! In T162462#3172669, @jcrespo wrote: >> I am commenting this here, please tell me if completely unrelated and I wil... [09:10:56] !log Restarting Jenkins for plugins update (2) [09:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:38] PROBLEM - Apache HTTP on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [09:11:48] PROBLEM - Nginx local proxy to apache on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.009 second response time [09:11:58] PROBLEM - HHVM rendering on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [09:12:20] <_joe_> !log copying data from / to the neww partition on ocg1003 T162462 [09:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:27] T162462: Standalone puppet masters are broken (uninstallable packages) - https://phabricator.wikimedia.org/T162462 [09:12:38] RECOVERY - Apache HTTP on mw1265 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.025 second response time [09:12:45] (03PS1) 10Volans: Switchdc: rename redis stage from t05 to t06 [puppet] - 10https://gerrit.wikimedia.org/r/347816 (https://phabricator.wikimedia.org/T160178) [09:12:48] RECOVERY - Nginx local proxy to apache on mw1265 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.036 second response time [09:12:58] RECOVERY - HHVM rendering on mw1265 is OK: HTTP OK: HTTP/1.1 200 OK - 75218 bytes in 0.165 second response time [09:16:28] PROBLEM - puppet last run on bast1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:17:18] <_joe_> mw1265 is one of the hosts on 3.18 right? [09:17:34] <_joe_> moritzm: might be worth looking into ^^ [09:18:36] 06Operations, 10Monitoring, 07LDAP: allow paging to work properly in ldap - https://phabricator.wikimedia.org/T162745#3174465 (10Peachey88) [09:19:52] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM, but maybe using a variable for that in the first place would have helped :P" [puppet] - 10https://gerrit.wikimedia.org/r/347816 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [09:20:08] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:20:41] (03PS1) 10Addshore: WMDE Spring campaign - Add logging from WikimediaEvents [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347817 [09:21:03] (03PS1) 10Addshore: WMDE Spring campaign - Remove logging (no longer needed) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347818 [09:21:24] (03CR) 10Addshore: [C: 04-2] "To be deployed after the 22nd April 2017" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347818 (owner: 10Addshore) [09:22:28] (03CR) 10Addshore: [C: 032] WMDE Spring campaign - Add logging from WikimediaEvents [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347817 (owner: 10Addshore) [09:22:58] !log Restarting Jenkins for Matrix related plugins updates (3) [09:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:23] 06Operations, 10DBA, 10Monitoring, 13Patch-For-Review: Create script to monitor db dumps for backups are successful (and if not, old backups are not deleted) - https://phabricator.wikimedia.org/T151999#3174470 (10fgiunchedi) >>! In T151999#2856939, @akosiaris wrote: > So, somehow we missed the error there.... [09:23:25] (03Merged) 10jenkins-bot: WMDE Spring campaign - Add logging from WikimediaEvents [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347817 (owner: 10Addshore) [09:23:38] (03CR) 10jenkins-bot: WMDE Spring campaign - Add logging from WikimediaEvents [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347817 (owner: 10Addshore) [09:23:44] (03PS2) 10Volans: Switchdc: rename redis stage from t05 to t06 [puppet] - 10https://gerrit.wikimedia.org/r/347816 (https://phabricator.wikimedia.org/T160178) [09:24:13] (03CR) 10jerkins-bot: [V: 04-1] Switchdc: rename redis stage from t05 to t06 [puppet] - 10https://gerrit.wikimedia.org/r/347816 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [09:24:21] volans: I will recheck your change [09:24:46] thanks hashar [09:25:32] so much jobs going on that it is hard to restart jenkins gracefully :( [09:25:40] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/347816 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [09:25:46] hashar: no hurry for mine [09:25:48] can wait [09:25:53] feel free to abort it [09:25:58] if that helps [09:26:08] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: [[gerrit:347817|WMDE Spring campaign - Add logging from WikimediaEvent]] (duration: 00m 46s) [09:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:38] https://phabricator.wikimedia.org/T162780 interesting, title says ogc and the desc says ocg [09:28:46] (just a side note) [09:29:17] 06Operations: ocg1003 partitions are severely misconfigured - https://phabricator.wikimedia.org/T162780#3174473 (10Volans) [09:29:22] thanks revi, fixed [09:29:32] :D [09:31:08] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:35:00] (03PS1) 10Addshore: wmgUseGettingStarted false for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347819 [09:36:05] (03PS1) 10Addshore: wmgUseGettingStarted true for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347820 [09:36:15] (03CR) 10Addshore: [C: 04-2] "not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347820 (owner: 10Addshore) [09:38:08] (03CR) 10Addshore: [C: 032] wmgUseGettingStarted false for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347819 (owner: 10Addshore) [09:39:24] (03Merged) 10jenkins-bot: wmgUseGettingStarted false for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347819 (owner: 10Addshore) [09:39:33] (03CR) 10jenkins-bot: wmgUseGettingStarted false for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347819 (owner: 10Addshore) [09:41:28] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: [[gerrit:347819|wmgUseGettingStarted false for dewiki]] (duration: 00m 45s) [09:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:19] !log starting load on elastic2020 - T149006 [09:42:21] (03PS1) 10Joal: Update pivot config template file [puppet] - 10https://gerrit.wikimedia.org/r/347821 [09:42:24] elukey: --^ [09:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:26] T149006: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006 [09:43:01] 06Operations, 06Labs: Standalone puppet masters are broken (uninstallable packages) - https://phabricator.wikimedia.org/T162462#3174556 (10Joe) >>! In T162462#3174462, @akosiaris wrote: > It has stopped happening since those last lines I 've pasted above (something by cron? logrotate?). I 'll keep an eye for i... [09:44:08] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [09:44:38] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp1071_v4, cp1071_v6 [09:44:38] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 53 not-conn: cp1071_v6 [09:44:38] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 53 not-conn: cp1071_v6 [09:45:18] that's me ^ [09:45:38] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 70 ESP OK [09:45:38] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 54 ESP OK [09:45:38] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 54 ESP OK [09:46:39] 06Operations, 06Labs: Standalone puppet masters are broken (uninstallable packages) - https://phabricator.wikimedia.org/T162462#3174557 (10akosiaris) >>! In T162462#3174556, @Joe wrote: >>>! In T162462#3174462, @akosiaris wrote: >> It has stopped happening since those last lines I 've pasted above (something b... [09:47:01] <_joe_> !log remounting the new partition under /srv/deployment/ocg/output, cleaning out the old dir. Will cause a service interruption for requests to ocg1003 for a few minutes. T162780 [09:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:08] T162780: ocg1003 partitions are severely misconfigured - https://phabricator.wikimedia.org/T162780 [09:47:27] (03PS2) 10Elukey: Update pivot config template file [puppet] - 10https://gerrit.wikimedia.org/r/347821 (owner: 10Joal) [09:47:51] (03CR) 10Elukey: [V: 032 C: 032] Update pivot config template file [puppet] - 10https://gerrit.wikimedia.org/r/347821 (owner: 10Joal) [09:47:58] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: connection error: HTTPConnectionPool(host=localhost, port=8000): Max retries exceeded with url: /?command=health (Caused by class socket.error: [Errno 111] Connection refused) [09:49:08] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [09:50:05] 06Operations, 15User-fgiunchedi: Decomission ms-be2001 - ms-be2012 - https://phabricator.wikimedia.org/T162785#3174562 (10fgiunchedi) [09:50:08] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - ocg_8000 - Could not depool server ocg1003.eqiad.wmnet because of too many down! [09:51:29] <_joe_> that should now recover [09:51:32] <_joe_> it's working now [09:52:08] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [09:52:39] !log swift codfw-prod: ms-be2001 - ms-be2012 initial decom - T162785 [09:52:43] 06Operations, 13Patch-For-Review, 15User-Elukey: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#3174579 (10Marostegui) [09:52:45] 06Operations, 10DBA, 06Labs: fstrim: Operation not supported on Labs DBs - https://phabricator.wikimedia.org/T151746#3174577 (10Marostegui) 05Open>03Resolved Closing this as that cronjob isn't there anymore on those labs hosts (which will be decommissioned soon) and it is not present on the new ones either. [09:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:46] T162785: Decomission ms-be2001 - ms-be2012 - https://phabricator.wikimedia.org/T162785 [09:52:53] !log testing t03 and t07 DB-RO/RW stages of switchdc (codfw->eqiad), we are already in that situation, t03 will fail the verfication, is expected [09:52:54] <_joe_> !log removing the old directory of data from ocg1003 [09:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:28] RECOVERY - Disk space on ocg1003 is OK: DISK OK [09:56:00] !log switchdc (volans@neodymium) START TASK - switchdc.stages.t03_coredb_masters_readonly(codfw, eqiad) set core DB masters in read-only mode [09:56:03] !log switchdc (volans@neodymium) END TASK - switchdc.stages.t03_coredb_masters_readonly(codfw, eqiad) Failed to execute [09:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:28] all good, failure was expected [09:56:35] eqiad is ofc not read-only [09:56:50] I don't think it should fail [09:57:13] it's the only way to ensure that we are in read-only, it set RO the dc_from and verify that both are RO [09:57:27] ah [09:57:32] so it is the first step [09:57:52] I thoght it was the other one, then it makes sense [09:57:55] this was the t03: https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t03_coredb_masters_readonly.py [09:58:04] yes yes [09:58:04] then now I'll try the t07 that will success [09:58:09] *succeed [09:58:12] at least in theory :D [09:58:13] yeah, that is the one I was talkign about [09:58:35] going now [09:58:50] !log switchdc (volans@neodymium) START TASK - switchdc.stages.t07_coredb_masters_readwrite(codfw, eqiad) set core DB masters in read-write mode [09:58:53] !log switchdc (volans@neodymium) END TASK - switchdc.stages.t07_coredb_masters_readwrite(codfw, eqiad) Successfully completed [09:58:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:07] done [09:59:13] <_joe_> volans: tendril? [09:59:15] <_joe_> :) [09:59:23] yes, that was already tested, I'll do that too later [09:59:26] <_joe_> the we'll have "tested" everything [09:59:30] <_joe_> *then [10:00:09] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:02:28] RECOVERY - puppet last run on bast1001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [10:02:52] <_joe_> ok I am going to look into the puppet failures [10:03:03] <_joe_> they are caused by conftool errors ^^ [10:07:08] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [10:07:42] !log Upgrading Jenkins "Git client" plugin 2.3.0..2.4.1 and restarting Jenkins [10:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:28] 06Operations, 06Labs, 06Release-Engineering-Team, 07Zuul: Upgrade pbr for zuul - https://phabricator.wikimedia.org/T162787#3174639 (10Paladox) [10:08:43] 06Operations, 06Labs, 06Release-Engineering-Team, 07Zuul: Upgrade pbr for zuul - https://phabricator.wikimedia.org/T162787#3174651 (10Paladox) [10:15:28] PROBLEM - puppet last run on bast1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:15:32] (03PS1) 10Alexandros Kosiaris: puppetmaster: Remove the jessie-backports pinning [puppet] - 10https://gerrit.wikimedia.org/r/347825 (https://phabricator.wikimedia.org/T162462) [10:22:01] (03PS2) 10Alexandros Kosiaris: puppetmaster: Remove the jessie-backports pinning [puppet] - 10https://gerrit.wikimedia.org/r/347825 (https://phabricator.wikimedia.org/T162462) [10:26:43] 06Operations, 10DBA, 10Monitoring, 13Patch-For-Review: Create script to monitor db dumps for backups are successful (and if not, old backups are not deleted) - https://phabricator.wikimedia.org/T151999#3174702 (10jcrespo) [10:26:44] 06Operations, 10DBA: Puppetize grants for mysql backups on dbstore hosts - https://phabricator.wikimedia.org/T111929#3174703 (10jcrespo) [10:34:07] !log Upgrading Jenkins "Email Extension" plugin 2.57.1..2.57.2 and restarting Jenkins [10:34:10] 06Operations, 10DBA: Create less overhead on bacula jobs - https://phabricator.wikimedia.org/T162789#3174710 (10jcrespo) [10:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:32] 06Operations, 10DBA: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789#3174710 (10jcrespo) [10:36:47] (03PS1) 10Giuseppe Lavagetto: wmflib: add some information to failures of conftool [puppet] - 10https://gerrit.wikimedia.org/r/347827 [10:37:08] (03PS1) 10Volans: Traffic: format only, noop [puppet] - 10https://gerrit.wikimedia.org/r/347828 (https://phabricator.wikimedia.org/T160178) [10:38:45] (03CR) 10Giuseppe Lavagetto: [C: 032] wmflib: add some information to failures of conftool [puppet] - 10https://gerrit.wikimedia.org/r/347827 (owner: 10Giuseppe Lavagetto) [10:43:28] RECOVERY - puppet last run on bast1001 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [10:45:54] (03CR) 10Alexandros Kosiaris: [C: 031] "https://puppet-compiler.wmflabs.org/6134/ says ok" [puppet] - 10https://gerrit.wikimedia.org/r/347825 (https://phabricator.wikimedia.org/T162462) (owner: 10Alexandros Kosiaris) [10:46:16] (03PS3) 10Alexandros Kosiaris: puppetmaster: Remove the jessie-backports pinning [puppet] - 10https://gerrit.wikimedia.org/r/347825 (https://phabricator.wikimedia.org/T162462) [10:46:27] (03CR) 10Alexandros Kosiaris: [C: 032] puppetmaster: Remove the jessie-backports pinning [puppet] - 10https://gerrit.wikimedia.org/r/347825 (https://phabricator.wikimedia.org/T162462) (owner: 10Alexandros Kosiaris) [10:46:32] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] puppetmaster: Remove the jessie-backports pinning [puppet] - 10https://gerrit.wikimedia.org/r/347825 (https://phabricator.wikimedia.org/T162462) (owner: 10Alexandros Kosiaris) [10:48:19] 06Operations, 06Labs, 13Patch-For-Review: Standalone puppet masters are broken (uninstallable packages) - https://phabricator.wikimedia.org/T162462#3174739 (10akosiaris) >>! In T162462#3173260, @Andrew wrote: > Fixing the puppetmaster issue requires changing (well, removing) the pinning in the puppet manifes... [10:50:48] PROBLEM - DPKG on labvirt1006 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:50:49] PROBLEM - DPKG on db2037 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:50:49] PROBLEM - DPKG on labvirt1008 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:50:54] that's ^ me [10:51:28] PROBLEM - DPKG on labvirt1005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:51:38] PROBLEM - DPKG on dbstore1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:51:49] RECOVERY - DPKG on labvirt1006 is OK: All packages OK [10:51:49] RECOVERY - DPKG on db2037 is OK: All packages OK [10:51:49] RECOVERY - DPKG on labvirt1008 is OK: All packages OK [10:52:28] PROBLEM - DPKG on labcontrol1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:52:28] PROBLEM - DPKG on labvirt1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:52:28] RECOVERY - DPKG on labvirt1005 is OK: All packages OK [10:53:28] RECOVERY - DPKG on labcontrol1001 is OK: All packages OK [10:53:28] RECOVERY - DPKG on labvirt1002 is OK: All packages OK [10:53:38] RECOVERY - DPKG on dbstore1002 is OK: All packages OK [10:58:08] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:58:35] akosiaris hi, how did you manage to upgrade puppetmaster puppet packages [10:58:38] as i get [10:58:39] The following packages have unmet dependencies: [10:58:39] puppet : Breaks: facter (< 2.4.0~) but 2.2.0-1 is to be installed [10:58:39] Breaks: puppetmaster-common (< 4.4.2-1~) but 3.8.5-2~bpo8+1 is to be installed [10:58:39] puppetmaster-common : D [10:58:48] PROBLEM - puppet last run on snapshot1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:59:28] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:59:28] PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:59:28] PROBLEM - puppet last run on db1049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:59:48] PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:59:59] paladox: want to try again ? I 've just merged https://gerrit.wikimedia.org/r/347825 [11:00:08] PROBLEM - puppet last run on ms-be2021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:00:24] ok [11:00:26] paladox: and I did not really have to upgrade anything. We were already at 3.8 and wanted to stay there [11:00:38] Oh [11:00:42] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3174778 (10Marostegui) btw @Cmjohnson db1042 can be decommissioned (it is on b2, so maybe db1099 can take its place?): https://phabricator.wikimedia.org/T149793 [11:00:48] i have puppet 3.7 installed [11:01:07] paladox: then apt-get install puppet should do the right thing now after running puppet once [11:01:12] but 3.7 is fine [11:01:18] ok [11:02:06] !log upgrade puppet across the trusty fleet to 3.8. T162462 [11:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:13] T162462: Standalone puppet masters are broken (uninstallable packages) - https://phabricator.wikimedia.org/T162462 [11:04:26] akosiaris nope still happens [11:04:31] for the puppetmaster packages [11:04:41] and puppet [11:05:14] <_joe_> !log downgrading python-urllib3 on puppetmaster1001 [11:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:57] puppet : Breaks: facter (< 2.4.0~) but 2.2.0-1 is to be installed [11:07:16] i ran puppet agent twice [11:13:11] hmm maybe we don't autoclean the .pref files [11:13:28] oh, wheres the .pref files. [11:13:52] it's /etc/apt/preferences.d/puppet.pref [11:13:54] oh [11:13:56] remove it [11:13:56] ah [11:13:56] Package: puppet puppet-common puppet-el puppetmaster puppetmaster-common puppetmaster-passenger vim-puppet [11:13:57] Pin: release a=jessie-backports [11:13:57] Pin-Priority: 1001 [11:13:58] oh [11:13:59] ok [11:14:03] rm /etc/apt/preferences.d/puppet.pref [11:14:57] that worked [11:15:01] i see no updates now [11:15:07] ok [11:15:20] thanks [11:15:58] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[ensure_debdeploy-puppetmaster-frontend_standard],Exec[ensure_trebuchet_master_tin.eqiad.wmnet] [11:16:28] RECOVERY - puppet last run on db1049 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [11:17:28] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [11:17:48] RECOVERY - puppet last run on snapshot1006 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [11:17:48] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [11:18:08] RECOVERY - puppet last run on ms-be2021 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [11:19:05] 06Operations, 15User-fgiunchedi: Reduce Swift technical debt - https://phabricator.wikimedia.org/T162792#3174790 (10fgiunchedi) [11:19:28] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [11:20:16] 06Operations, 15User-fgiunchedi: Reduce Swift technical debt - https://phabricator.wikimedia.org/T162792#3174803 (10fgiunchedi) [11:20:17] 06Operations, 10media-storage: Consider storage policies for swift - https://phabricator.wikimedia.org/T151648#3174805 (10fgiunchedi) [11:20:20] 06Operations, 10media-storage, 15User-fgiunchedi: Swift version and distro upgrade - https://phabricator.wikimedia.org/T162609#3174804 (10fgiunchedi) [11:20:30] 06Operations, 10media-storage, 15User-fgiunchedi: Swift version and distro upgrade - https://phabricator.wikimedia.org/T162609#3168478 (10fgiunchedi) a:03fgiunchedi [11:21:01] 06Operations, 06Labs, 13Patch-For-Review: Standalone puppet masters are broken (uninstallable packages) - https://phabricator.wikimedia.org/T162462#3174807 (10akosiaris) 05Open>03Resolved Note that after merging https://gerrit.wikimedia.org/r/347825 removal of /etc/apt/preferences.d/puppet.pref is requir... [11:24:34] 06Operations, 15User-fgiunchedi: Rate limit swift operations - https://phabricator.wikimedia.org/T162793#3174813 (10fgiunchedi) [11:25:49] (03PS1) 10Giuseppe Lavagetto: role::puppet_compiler: fix etcd protocol [puppet] - 10https://gerrit.wikimedia.org/r/347832 [11:29:38] (03CR) 10Giuseppe Lavagetto: [C: 032] role::puppet_compiler: fix etcd protocol [puppet] - 10https://gerrit.wikimedia.org/r/347832 (owner: 10Giuseppe Lavagetto) [11:33:16] 06Operations, 15User-fgiunchedi: Delete non-used and/or non-requested thumbnail sizes periodically - https://phabricator.wikimedia.org/T162796#3174860 (10fgiunchedi) [11:42:08] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:42:18] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:42:18] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:42:18] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:42:28] PROBLEM - cassandra-c SSL 10.192.16.178:7001 on restbase2007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [11:42:28] PROBLEM - cassandra-c CQL 10.192.16.178:9042 on restbase2007 is CRITICAL: connect to address 10.192.16.178 and port 9042: Connection refused [11:42:31] mutante: re mw2256 and dsh - will take care of it later on today, still reimaging :( [11:42:58] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [11:43:12] mmm [11:43:18] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [11:43:18] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [11:43:18] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [11:44:38] PROBLEM - Check systemd state on restbase2007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:44:58] PROBLEM - cassandra-c service on restbase2007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [11:45:17] ah this seems to be a tombstone issue [11:45:23] urandom: ---^ [11:46:02] from logstash I can see Error: Operation timed out - received only 1 responses. [11:46:52] checking cassandra-c [11:48:28] yep, running puppet to bring the instance up again [11:48:58] RECOVERY - cassandra-c service on restbase2007 is OK: OK - cassandra-c is active [11:49:38] RECOVERY - Check systemd state on restbase2007 is OK: OK - running: The system is fully operational [11:50:28] RECOVERY - cassandra-c SSL 10.192.16.178:7001 on restbase2007 is OK: SSL OK - Certificate restbase2007-c valid until 2017-09-12 15:35:55 +0000 (expires in 153 days) [11:51:28] RECOVERY - cassandra-c CQL 10.192.16.178:9042 on restbase2007 is OK: TCP OK - 0.036 second response time on 10.192.16.178 port 9042 [11:52:15] ah nice now we have tombstone_warn_threshold \o/ [11:56:28] (03PS2) 10Alexandros Kosiaris: Change bacula retention policies and volume number [puppet] - 10https://gerrit.wikimedia.org/r/341817 [11:57:05] !log restart Yarn nodemanager daemons on all the Hadoop worker node to pick up the new JVM settings [11:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:34] (03PS1) 10Giuseppe Lavagetto: profile::conftool::client: accept protocol as a parameter [puppet] - 10https://gerrit.wikimedia.org/r/347834 [12:01:43] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::conftool::client: accept protocol as a parameter [puppet] - 10https://gerrit.wikimedia.org/r/347834 (owner: 10Giuseppe Lavagetto) [12:09:58] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [12:10:08] PROBLEM - SSH on ms-be1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:10:18] PROBLEM - puppet last run on ms-be1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:10:28] PROBLEM - swift-container-replicator on ms-be1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:10:28] PROBLEM - swift-object-replicator on ms-be1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:10:28] PROBLEM - MD RAID on ms-be1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:10:38] PROBLEM - Disk space on ms-be1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:10:58] PROBLEM - dhclient process on mw2246 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:10:58] RECOVERY - SSH on ms-be1039 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [12:11:08] RECOVERY - puppet last run on ms-be1039 is OK: OK: Puppet is currently enabled, last run 8 minutes ago with 0 failures [12:11:08] PROBLEM - puppet last run on mw2246 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:11:18] RECOVERY - swift-container-replicator on ms-be1039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [12:11:18] PROBLEM - Check whether ferm is active by checking the default input chain on mw2246 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:11:18] RECOVERY - swift-object-replicator on ms-be1039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [12:11:18] PROBLEM - Check size of conntrack table on mw2246 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:11:18] PROBLEM - DPKG on mw2246 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:11:18] PROBLEM - HHVM jobrunner on mw2246 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:11:18] PROBLEM - configured eth on mw2246 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:11:19] PROBLEM - Disk space on mw2246 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:11:19] PROBLEM - nutcracker port on mw2246 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:11:20] PROBLEM - nutcracker process on mw2246 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:11:20] PROBLEM - salt-minion processes on mw2246 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:11:21] RECOVERY - MD RAID on ms-be1039 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [12:11:28] PROBLEM - MD RAID on mw2246 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:11:28] RECOVERY - Disk space on ms-be1039 is OK: DISK OK [12:13:05] (03CR) 10Alexandros Kosiaris: [C: 032] Change bacula retention policies and volume number [puppet] - 10https://gerrit.wikimedia.org/r/341817 (owner: 10Alexandros Kosiaris) [12:13:08] (03PS3) 10Alexandros Kosiaris: Change bacula retention policies and volume number [puppet] - 10https://gerrit.wikimedia.org/r/341817 [12:13:12] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Change bacula retention policies and volume number [puppet] - 10https://gerrit.wikimedia.org/r/341817 (owner: 10Alexandros Kosiaris) [12:14:13] !log kartik@tin Started deploy [cxserver/deploy@2842efa]: Update cxserver to 56a012d [12:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:28] ms-be1039 was me btw [12:18:11] !log kartik@tin Finished deploy [cxserver/deploy@2842efa]: Update cxserver to 56a012d (duration: 03m 58s) [12:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:13] mw2246 is mine, downtime expired for reimage.. [12:20:15] * elukey cries [12:20:50] it was stuck in wmf-reimage due to the partman recipe [12:20:58] pxe boot manually etc.. [12:21:09] at this point I'll wait a bit that puppet finish and it should recover [12:23:00] (03PS1) 10Alexandros Kosiaris: Move backup::openldapset into the respective roles [puppet] - 10https://gerrit.wikimedia.org/r/347836 [12:23:03] (03PS1) 10Alexandros Kosiaris: backup::set: Allow passing an $extras config hash [puppet] - 10https://gerrit.wikimedia.org/r/347837 [12:23:36] !log restart HDFS datanode daemons on all the Hadoop worker node to pick up the new JVM settings [12:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:08] PROBLEM - NTP on mw2246 is CRITICAL: NTP CRITICAL: No response from NTP server [12:32:08] RECOVERY - salt-minion processes on mw2246 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:32:09] RECOVERY - Check whether ferm is active by checking the default input chain on mw2246 is OK: OK ferm input default policy is set [12:32:09] RECOVERY - Disk space on mw2246 is OK: DISK OK [12:32:09] RECOVERY - Check size of conntrack table on mw2246 is OK: OK: nf_conntrack is 0 % full [12:32:09] RECOVERY - configured eth on mw2246 is OK: OK - interfaces up [12:32:34] there you go [12:32:42] and this time Trusty with no /tmp partition [12:32:47] RECOVERY - dhclient process on mw2246 is OK: PROCS OK: 0 processes with command name dhclient [12:33:07] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [12:34:07] RECOVERY - DPKG on mw2246 is OK: All packages OK [12:36:50] jouncebot: refresh [12:36:55] I refreshed my knowledge about deployments. [12:36:56] jouncebot: next [12:36:56] In 0 hour(s) and 23 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170412T1300) [12:37:07] ah no patches great [12:37:14] I am commuting back to office [12:40:07] RECOVERY - HHVM jobrunner on mw2246 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.078 second response time [12:41:45] (03PS4) 10Hoo man: Change dumpwikidatattl to allow producing other flavors [puppet] - 10https://gerrit.wikimedia.org/r/347234 (https://phabricator.wikimedia.org/T155103) [12:41:46] (03PS1) 10Hoo man: Allow running two dumpwikidatattl dumps side by side [puppet] - 10https://gerrit.wikimedia.org/r/347838 (https://phabricator.wikimedia.org/T155103) [12:41:57] RECOVERY - puppet last run on mw2246 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [12:42:07] RECOVERY - nutcracker port on mw2246 is OK: TCP OK - 0.000 second response time on port 11212 [12:42:07] RECOVERY - nutcracker process on mw2246 is OK: PROCS OK: 1 process with UID = 109 (nutcracker), command name nutcracker [12:47:27] PROBLEM - mediawiki-installation DSH group on mw2152 is CRITICAL: Host mw2152 is not in mediawiki-installation dsh group [12:47:53] fixing --^ [12:53:05] 06Operations, 10Cassandra, 06Services (done): RAID-0 volume not mounted on restbase-dev1001.eqiad.wmnet - https://phabricator.wikimedia.org/T162614#3175069 (10Eevans) 05Open>03Resolved >>! In T162614#3174383, @fgiunchedi wrote: > Yes I can confirm the raid setup (both for `/` and `/srv`) is handled by pa... [12:53:50] jouncebot: next [12:53:51] In 0 hour(s) and 6 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170412T1300) [12:54:27] PROBLEM - HP RAID on ms-be1032 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [12:56:09] 06Operations, 10DBA: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789#3175078 (10fgiunchedi) For context: fixing this would also alleviate a current problem where long-running backup jobs stall both other backup jobs and restore jobs as well (e.g... [12:56:43] (03PS2) 10Ema: cache: noop to test the switchdc procedures [puppet] - 10https://gerrit.wikimedia.org/r/347828 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [12:57:01] (03CR) 10Ema: [C: 031] cache: noop to test the switchdc procedures [puppet] - 10https://gerrit.wikimedia.org/r/347828 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [12:59:07] RECOVERY - NTP on mw2246 is OK: NTP OK: Offset -0.0001857280731 secs [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170412T1300). Please do the needful. [13:00:34] ah [13:00:38] hoo: I am around [13:00:51] I have 1 thing to add to the window too! [13:01:51] You probably want to start with addshore then [13:02:04] (03PS2) 10Hashar: Revert "Temporarily enable change dispatch logging on testwikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346788 (https://phabricator.wikimedia.org/T159828) (owner: 10Hoo man) [13:02:09] I just added mine to the cal [13:02:10] ah [13:02:15] I rebased hoo change [13:02:18] hashar: If you would like I can do mine! [13:02:20] so I guess we can do that one first? [13:02:30] Yeah, the logging revert can be done at any time [13:02:34] trivial thing [13:02:57] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346788 (https://phabricator.wikimedia.org/T159828) (owner: 10Hoo man) [13:02:59] addshore: sure [13:03:19] o/ [13:03:28] hashar: I'll +2 it now for CI to run, dont be surprised if you fetch it too, but it is in an ext so wont get in your way [13:03:53] hashar, addshore: looks like you are in charge for swat, ping me if you need me :) [13:04:06] addshore: yeah that is what I do usually. Mass +2 anything that is slow to merge [13:04:08] (03Merged) 10jenkins-bot: Revert "Temporarily enable change dispatch logging on testwikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346788 (https://phabricator.wikimedia.org/T159828) (owner: 10Hoo man) [13:04:22] (03CR) 10jenkins-bot: Revert "Temporarily enable change dispatch logging on testwikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346788 (https://phabricator.wikimedia.org/T159828) (owner: 10Hoo man) [13:04:43] hoo: should we test the change on mwdebug1001 ? [13:05:46] (03PS15) 10Matthias Mullie: Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [13:06:14] hashar: No need, this should just work [13:06:25] and if it doesn't, we will notice shortly [13:06:31] but having a few logs to many wont hurt [13:06:32] * hashar wonders why "scap pull" takes so long [13:07:18] hashar: FYI I am ready to do the 2 syncs needed for my patch [13:07:35] hashar: i think we are on scap3 now fyi [13:07:50] syncing hoo change [13:08:30] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Revert "Temporarily enable change dispatch logging on testwikidata" - T159828 (duration: 00m 47s) [13:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:38] T159828: Use redis-based lock manager for dispatchChanges on test.wikidata.org - https://phabricator.wikimedia.org/T159828 [13:08:40] addshore: it is all your! [13:08:53] doing [13:09:04] hoo: also you listed "Update Wikidata: ArticlePlaceholder" but there is no gerrit changes linked to it [13:09:10] hashar: On that [13:09:49] !log addshore@tin Synchronized php-1.29.0-wmf.20/extensions/WikimediaEvents/WikimediaEventsHooks.php: [[gerrit:347773|WMDE Spring campaign]] PT1/2 (duration: 00m 45s) [13:09:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:08] hashar: Updated… do you want to do it or shall I? [13:10:46] !log addshore@tin Synchronized php-1.29.0-wmf.20/extensions/WikimediaEvents/extension.json: [[gerrit:347773|WMDE Spring campaign]] PT2/2 (duration: 00m 45s) [13:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:59] hashar: hoo thats me all done [13:12:34] I'm ready [13:12:48] can either do it myself or follow along [13:15:13] hoo: please do :-} [13:15:17] I am around to assist as needed [13:15:42] zeljkof: yeah all covered :-} with hoo and addshore there is not much to do anyway [13:17:01] (03PS1) 10BBlack: ulsfo recdns: prefer codfw to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/347842 [13:17:03] (03PS1) 10BBlack: acamar/achernar -> jessie [puppet] - 10https://gerrit.wikimedia.org/r/347843 [13:17:09] hashar: :) [13:17:28] (03PS2) 10Alexandros Kosiaris: backup::set: Allow passing an $extras config hash [puppet] - 10https://gerrit.wikimedia.org/r/347837 [13:17:54] hashar: On it :) [13:18:15] (03CR) 10BBlack: [C: 032] ulsfo recdns: prefer codfw to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/347842 (owner: 10BBlack) [13:19:46] (03PS2) 10BBlack: acamar/achernar -> jessie [puppet] - 10https://gerrit.wikimedia.org/r/347843 (https://phabricator.wikimedia.org/T155411) [13:24:00] (03PS1) 10Volans: Traffic: exclude .wikimedia.org hosts (cp1008) [switchdc] - 10https://gerrit.wikimedia.org/r/347844 (https://phabricator.wikimedia.org/T160178) [13:24:56] (03Abandoned) 10BBlack: swift/upload: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/345592 (owner: 10BBlack) [13:25:23] (03CR) 10Ema: [C: 031] Traffic: exclude .wikimedia.org hosts (cp1008) [switchdc] - 10https://gerrit.wikimedia.org/r/347844 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [13:25:51] !log applied CONFIG SET slowlog-log-slower-than 300000 to Redis 6379 on rdb2005 and reset slowlog history to play with the stats [13:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:19] ideally I'd like to do the same on rdb1005 [13:27:42] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3175164 (10Gehel) elastic2020 has a good workout with the old disks (same stress + bonnie test). No problem seen. More detailed timing can be s... [13:28:03] (03PS2) 10Volans: Traffic: exclude .wikimedia.org hosts (cp1008) [switchdc] - 10https://gerrit.wikimedia.org/r/347844 (https://phabricator.wikimedia.org/T160178) [13:28:41] elukey: ah slow log, that is interesting [13:28:51] (03CR) 10Alexandros Kosiaris: "https://puppet-compiler.wmflabs.org/6139/pollux.wikimedia.org/ says ok, merging" [puppet] - 10https://gerrit.wikimedia.org/r/347837 (owner: 10Alexandros Kosiaris) [13:29:01] (03PS2) 10Alexandros Kosiaris: Move backup::openldapset into the respective roles [puppet] - 10https://gerrit.wikimedia.org/r/347836 [13:29:43] hashar: atm we have 10000 (10ms) that logs basically everything [13:30:06] rdb2005 is a codfw replica so I don't expect much result [13:31:18] (03CR) 10Alexandros Kosiaris: [C: 032] Move backup::openldapset into the respective roles [puppet] - 10https://gerrit.wikimedia.org/r/347836 (owner: 10Alexandros Kosiaris) [13:31:28] (03PS3) 10Alexandros Kosiaris: backup::set: Allow passing an $extras config hash [puppet] - 10https://gerrit.wikimedia.org/r/347837 [13:31:28] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3175169 (10Marostegui) db1019 can also go away (T147309) and it is on b1. Just mentioning it here to make sure you are aware, just in case if it is easier for you to complet... [13:31:34] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] backup::set: Allow passing an $extras config hash [puppet] - 10https://gerrit.wikimedia.org/r/347837 (owner: 10Alexandros Kosiaris) [13:33:08] (03PS3) 10Volans: Traffic: exclude .wikimedia.org hosts (cp1008) [switchdc] - 10https://gerrit.wikimedia.org/r/347844 (https://phabricator.wikimedia.org/T160178) [13:33:17] !log restored slowlog-log-slower-than 10000 on rdb2005 [13:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:35] !log hoo@tin Synchronized php-1.29.0-wmf.19/extensions/Wikidata: Update Wikibase/ ArticlePlaceholder (duration: 02m 16s) [13:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:03] (03CR) 10Ema: [C: 031] Traffic: exclude .wikimedia.org hosts (cp1008) [switchdc] - 10https://gerrit.wikimedia.org/r/347844 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [13:37:07] (03PS3) 10BBlack: citoid: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/345543 [13:37:15] 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3175172 (10jcrespo) [13:37:17] 06Operations, 10ops-eqiad, 10hardware-requests: Decommission db1019 - https://phabricator.wikimedia.org/T147309#3175171 (10jcrespo) [13:37:18] (03CR) 10Volans: [C: 032] Traffic: exclude .wikimedia.org hosts (cp1008) [switchdc] - 10https://gerrit.wikimedia.org/r/347844 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [13:37:22] !log hoo@tin Synchronized php-1.29.0-wmf.20/extensions/Wikidata: Update Wikibase/ ArticlePlaceholder (duration: 02m 19s) [13:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:22] (03PS3) 10Volans: cache: noop to test the switchdc procedures [puppet] - 10https://gerrit.wikimedia.org/r/347828 (https://phabricator.wikimedia.org/T160178) [13:41:57] !log apply SLOWLOG RESET and CONFIG SET slowlog-max-len 100000 (prev value 10000, 10ms) to rdb1005:6380 to track down slow reqs - T125735 [13:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:05] T125735: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735 [13:42:18] !log testing t05_switch_traffic of the switchdc [13:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:22] 06Operations: codfw/eqiad hosts occasionally spend > 3 minutes starting networking.service with linux 4.9 - https://phabricator.wikimedia.org/T162612#3175188 (10ema) [13:43:58] (03PS5) 10Hoo man: Change dumpwikidatattl to allow producing other flavors [puppet] - 10https://gerrit.wikimedia.org/r/347234 (https://phabricator.wikimedia.org/T155103) [13:44:00] (03PS2) 10Hoo man: Allow running two dumpwikidatattl dumps side by side [puppet] - 10https://gerrit.wikimedia.org/r/347838 (https://phabricator.wikimedia.org/T155103) [13:44:13] we might get some IPSec-related alerts because of cp1072 (T162612) please ignore them [13:44:14] T162612: codfw/eqiad hosts occasionally spend > 3 minutes starting networking.service with linux 4.9 - https://phabricator.wikimedia.org/T162612 [13:44:27] RECOVERY - HP RAID on ms-be1032 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [13:44:57] RECOVERY - mediawiki-installation DSH group on mw2246 is OK: OK [13:47:27] (03PS13) 10Gehel: maps - cleartables osm replication [puppet] - 10https://gerrit.wikimedia.org/r/341563 (https://phabricator.wikimedia.org/T157613) [13:47:27] RECOVERY - mediawiki-installation DSH group on mw2152 is OK: OK [13:47:37] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:48:06] (03CR) 10Gehel: "rebased on top of the move to role / profile" [puppet] - 10https://gerrit.wikimedia.org/r/341563 (https://phabricator.wikimedia.org/T157613) (owner: 10Gehel) [13:48:51] !log switchdc (volans@neodymium) START TASK - switchdc.stages.t05_switch_traffic(codfw, eqiad) Switch traffic flow to the appservers in the new datacenter [13:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:10] (03CR) 10Volans: [C: 032] cache: noop to test the switchdc procedures [puppet] - 10https://gerrit.wikimedia.org/r/347828 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [13:51:31] (03PS14) 10Gehel: maps - cleartables osm replication [puppet] - 10https://gerrit.wikimedia.org/r/341563 (https://phabricator.wikimedia.org/T157613) [13:51:40] !log switchdc (volans@neodymium) END TASK - switchdc.stages.t05_switch_traffic(codfw, eqiad) Successfully completed [13:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:10] (03PS6) 10Hoo man: Change dumpwikidatattl to allow producing other flavors [puppet] - 10https://gerrit.wikimedia.org/r/347234 (https://phabricator.wikimedia.org/T155103) [13:52:12] (03PS3) 10Hoo man: Allow running two dumpwikidatattl dumps side by side [puppet] - 10https://gerrit.wikimedia.org/r/347838 (https://phabricator.wikimedia.org/T155103) [13:54:25] (03PS1) 10Alexandros Kosiaris: Disable backups on labtestweb2001 [puppet] - 10https://gerrit.wikimedia.org/r/347847 (https://phabricator.wikimedia.org/T159524) [13:54:37] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:55:23] (03PS1) 10Ema: Blacklist intel_uncore kernel module [puppet] - 10https://gerrit.wikimedia.org/r/347848 (https://phabricator.wikimedia.org/T162612) [13:56:10] (03PS16) 10Matthias Mullie: Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [13:56:13] 06Operations, 10ops-esams, 10netops: cr2-esams FPC 0 is dead - https://phabricator.wikimedia.org/T162239#3175202 (10ayounsi) 05Open>03Resolved Juniper received the faulty part, > Thank you for returning your defective product in relation to your recently created RMA. This notification confirms that Juni... [13:58:50] hoo: is your deployment complete ? [13:59:24] (03CR) 10Andrew Bogott: [C: 031] Disable backups on labtestweb2001 [puppet] - 10https://gerrit.wikimedia.org/r/347847 (https://phabricator.wikimedia.org/T159524) (owner: 10Alexandros Kosiaris) [13:59:30] hashar: Yes [14:00:33] (03CR) 10Alexandros Kosiaris: [C: 032] "yay https://puppet-compiler.wmflabs.org/6141/" [puppet] - 10https://gerrit.wikimedia.org/r/347847 (https://phabricator.wikimedia.org/T159524) (owner: 10Alexandros Kosiaris) [14:06:15] !log European SWAT complete [14:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:38] (03PS3) 10BBlack: cxserver: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/345542 [14:08:35] 06Operations, 06Labs, 13Patch-For-Review: Standalone puppet masters are broken (uninstallable packages) - https://phabricator.wikimedia.org/T162462#3175218 (10Andrew) I'm getting clean puppet runs on labs instances with role::puppetmaster::standalone now. So, the labs case for this looks resolved -- anythin... [14:13:16] 06Operations, 10Traffic: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#2920977 (10Astinson) +1 to @Nuria 's comment: I think the main concern here from @DarTar and me is that external websites need to have some sort of awareness where the dark tr... [14:13:51] (03CR) 10BBlack: [C: 032] cxserver: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/345542 (owner: 10BBlack) [14:14:23] (03CR) 10BBlack: [C: 032] citoid: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/345543 (owner: 10BBlack) [14:14:29] (03PS4) 10BBlack: citoid: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/345543 [14:14:33] (03CR) 10BBlack: [V: 032 C: 032] citoid: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/345543 (owner: 10BBlack) [14:15:47] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [14:20:47] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [14:21:47] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 717379 [14:21:58] (03PS1) 10Alexandros Kosiaris: openstack: Decrease by 2 the backups held [puppet] - 10https://gerrit.wikimedia.org/r/347851 (https://phabricator.wikimedia.org/T159524) [14:23:10] !log Restarting Jenkins for git/scm plugins updates [14:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:51] 06Operations, 06Labs, 13Patch-For-Review: Standalone puppet masters are broken (uninstallable packages) - https://phabricator.wikimedia.org/T162462#3163952 (10chasemp) > I'm getting clean puppet runs on labs instances with role::puppetmaster::standalone now. So, the labs case for this looks resolved -- an... [14:26:37] (03CR) 10BBlack: [C: 031] Blacklist intel_uncore kernel module [puppet] - 10https://gerrit.wikimedia.org/r/347848 (https://phabricator.wikimedia.org/T162612) (owner: 10Ema) [14:28:14] (03PS2) 10Ema: Blacklist intel_uncore kernel module [puppet] - 10https://gerrit.wikimedia.org/r/347848 (https://phabricator.wikimedia.org/T162612) [14:28:24] (03CR) 10Ema: [V: 032 C: 032] Blacklist intel_uncore kernel module [puppet] - 10https://gerrit.wikimedia.org/r/347848 (https://phabricator.wikimedia.org/T162612) (owner: 10Ema) [14:30:39] 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3175256 (10elukey) Didn't find much from the SLOWLOG, I expected more commands to be l... [14:31:51] !log running maintain-meta_p on labsdb1001/1003/1009/1010/1011 [14:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:37] PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.30 seconds [14:42:42] Hi, im wondering how does icinga get connected to wikimedia hosts? Do you have to run a command that accepts a certificate? [14:42:50] Im comparing icinga2. [14:45:37] PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 323.23 seconds [14:46:08] why 2 alerts? [14:47:13] 06Operations, 10DBA, 10Icinga, 10Monitoring, 13Patch-For-Review: "db1047/eventlogging_sync processes" icinga alert is flaky since at least early January - https://phabricator.wikimedia.org/T123509#3175280 (10jcrespo) 05Open>03Resolved a:03jcrespo Not happening for a long time now. Also T124307. [14:47:47] jynus: the text changed [14:48:08] although I though was only on passive ones... [14:48:17] this is not passive [14:48:33] could be T159266 again [14:48:33] T159266: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266 [14:49:14] nope, I am checking [14:49:20] jynus: volans: the service went back to WARNING, then to CRITICAL again which triggered a second notification [14:49:29] The BBU and disks look fine [14:49:31] based on https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=db1047&service=MariaDB+Slave+Lag%3A+s2 [14:49:37] thanks hashar for checking the alert history :D [14:50:06] it is stable at 200-300 second behind [14:50:09] that is weird [14:51:41] there is a spike in inserts [14:51:46] probablly eventlogging [14:52:31] there is contention on the binlog, I am going to disable sync there [14:53:22] and lag recovered [14:53:28] interesting [14:53:36] i think db1047 can no longer handle eventlogging high insert rate [14:53:37] RECOVERY - MariaDB Slave Lag: s2 on db1047 is OK: OK slave_sql_lag Replication lag: 0.20 seconds [14:54:10] the bandwidth is not that large [14:54:13] in bytes [14:54:25] but the number of fsyncs required is [14:54:52] When I created https://grafana.wikimedia.org/dashboard/db/mysql?panelId=19&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1047&from=now-12h&to=now [14:55:03] I always said it was going to be a life-saver [14:55:10] normally the issues are more obvious [14:55:27] but when they aren't, it is a great chart to have [14:55:47] alternatively, we could expand the binlog cache [14:58:33] (03PS1) 10BBlack: traffic: a/p services switch to temporary a/a [puppet] - 10https://gerrit.wikimedia.org/r/347852 [14:58:35] (03PS1) 10BBlack: traffic: a/a services switch to codfw-only [puppet] - 10https://gerrit.wikimedia.org/r/347853 [14:58:37] (03CR) 10Volans: [C: 032] "puppet compiler seems happy: https://puppet-compiler.wmflabs.org/6142/" [puppet] - 10https://gerrit.wikimedia.org/r/347816 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:58:44] (03PS3) 10Volans: Switchdc: rename redis stage from t05 to t06 [puppet] - 10https://gerrit.wikimedia.org/r/347816 (https://phabricator.wikimedia.org/T160178) [14:59:19] jynus: that graph is indeed super helpful [15:06:58] (03PS3) 10Andrew Bogott: nova-compute monitoring: Check for one and only one nova-compute process [puppet] - 10https://gerrit.wikimedia.org/r/347688 (https://phabricator.wikimedia.org/T162640) [15:06:59] (03PS1) 10Andrew Bogott: maintain-meta_p: Add a missing quote-mark [puppet] - 10https://gerrit.wikimedia.org/r/347854 [15:08:46] (03PS1) 10Gehel: service::node - add a defined() guard on git deployment [puppet] - 10https://gerrit.wikimedia.org/r/347855 [15:11:59] (03PS2) 10Reedy: maintain-meta_p: Add a missing quote-mark [puppet] - 10https://gerrit.wikimedia.org/r/347854 (owner: 10Andrew Bogott) [15:14:21] (03CR) 10Andrew Bogott: [C: 032] maintain-meta_p: Add a missing quote-mark [puppet] - 10https://gerrit.wikimedia.org/r/347854 (owner: 10Andrew Bogott) [15:16:19] (03PS1) 10Ottomata: Remove unused ganglia views [puppet] - 10https://gerrit.wikimedia.org/r/347856 [15:17:14] (03PS2) 10Gehel: service::node - add a defined() guard on git deployment [puppet] - 10https://gerrit.wikimedia.org/r/347855 [15:17:46] (03PS2) 10Ottomata: Remove unused ganglia views [puppet] - 10https://gerrit.wikimedia.org/r/347856 [15:19:09] (03PS1) 10Elukey: check_hadoop_yarn_node_state: add syslog logging for CRITICAL states [puppet] - 10https://gerrit.wikimedia.org/r/347857 [15:22:03] (03CR) 10Ottomata: [C: 032] Remove unused ganglia views [puppet] - 10https://gerrit.wikimedia.org/r/347856 (owner: 10Ottomata) [15:24:59] 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3175375 (10hashar) When running a LUA script does it blocks new connections as well? [15:26:57] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:55] * elukey stares at mw1168 with anger against deadlocks [15:28:57] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [15:29:01] mmmm this might not be what I thought, only videoscaler overloaded [15:29:07] elukey@mw1168:~$ hhvmadm check-health [15:29:07] { "load":20 [15:29:07] , "queued":1 [15:29:17] elukey@mw1168:~$ hhvmadm check-health [15:29:17] { "load":19 [15:29:18] , "queued":0 [15:29:32] not good since this should not happen with the current settings [15:30:07] PROBLEM - HHVM jobrunner on mw1259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:30:27] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:30:49] * elukey sigh [15:31:07] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:31:20] https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=videoscaler&var-instance=All [15:31:23] wow [15:31:27] PROBLEM - HHVM jobrunner on mw1260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:32:03] this is hhvm that starts queueing and fails health checks [15:32:17] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [15:32:21] I'm experiencing some indef load [15:32:23] wasn't there a recent change to videoscalers somewhere upstream in the backscroll here? something about trusty [15:32:55] bblack: I reimaged some codfw videoscalers to trusty today [15:33:41] this mess already happened, the current settings for videoscalers *should* prevent threads exaustion but we probably need to revise them [15:33:57] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [15:34:57] RECOVERY - HHVM jobrunner on mw1259 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [15:37:17] RECOVERY - HHVM jobrunner on mw1260 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.004 second response time [15:44:46] for example, now for mw116[89] we have 14 transcode jobs that can be in flight at any given time, and 20 hhvm threads [15:45:04] nproc 32 [15:46:04] 12[59,60] 10 transcode jobs and 15 hhvm threads, nproc 16 [15:46:06] <_joe_> elukey: something has gone wrong, as in for some reason jobs are orphaned [15:46:28] <_joe_> elukey: in theory, 10 transcode jobs == 10 hhvm threads [15:46:32] yeah [15:46:39] <_joe_> because 1 job : 1 request to hhvm [15:47:57] (03CR) 10Dzahn: [C: 031] nova-compute monitoring: Check for one and only one nova-compute process [puppet] - 10https://gerrit.wikimedia.org/r/347688 (https://phabricator.wikimedia.org/T162640) (owner: 10Andrew Bogott) [15:48:07] PROBLEM - nova-network process on labtestnet2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-network [15:48:16] ^ me silencing [15:50:02] (03PS1) 10BBlack: traffic: swift temporary a/a [puppet] - 10https://gerrit.wikimedia.org/r/347859 [15:50:04] (03PS1) 10BBlack: traffic: swift a/p in codfw only [puppet] - 10https://gerrit.wikimedia.org/r/347860 [15:53:49] !log remove 2fa for Freddy2001 on wikitech per T162772 [15:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:56] T162772: Disable 2FA for Freddy2001 on Wikitech - https://phabricator.wikimedia.org/T162772 [15:56:58] 06Operations, 06Performance-Team, 10Wikidata, 10Wikimedia-Site-requests: Increase $wgExpensiveParserFunctionLimit on nowiki - https://phabricator.wikimedia.org/T160685#3175418 (10Krinkle) [16:12:43] jouncebot: now [16:12:43] No deployments scheduled for the next 1 hour(s) and 47 minute(s) [16:12:47] jouncebot: next [16:12:47] In 1 hour(s) and 47 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170412T1800) [16:13:01] 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3175451 (10elukey) >>! In T125735#3175375, @hashar wrote: > When running a LUA script... [16:14:04] K I'm trying to get donatewiki back on the train... [16:14:48] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 721744 [16:16:07] RECOVERY - nova-network process on labtestnet2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-network [16:18:41] (03CR) 10Krinkle: [C: 04-1] "Dereckson: Incorrect - these are in fact where the current files came from. Some are from the standard PNG rendering on Commons (almost al" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346234 (https://phabricator.wikimedia.org/T98640) (owner: 10Krinkle) [16:21:47] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0 [16:23:06] (03PS4) 10Andrew Bogott: nova-compute monitoring: Check for one and only one nova-compute process [puppet] - 10https://gerrit.wikimedia.org/r/347688 (https://phabricator.wikimedia.org/T162640) [16:23:58] (03PS1) 10Dzahn: site/install: unify install nodes into a single regex [puppet] - 10https://gerrit.wikimedia.org/r/347862 [16:25:37] (03CR) 10Andrew Bogott: [C: 032] nova-compute monitoring: Check for one and only one nova-compute process [puppet] - 10https://gerrit.wikimedia.org/r/347688 (https://phabricator.wikimedia.org/T162640) (owner: 10Andrew Bogott) [16:28:12] (03PS2) 10Andrew Bogott: nova.conf: Reduce fixed_ip_disassociate_timeout to three minutes. [puppet] - 10https://gerrit.wikimedia.org/r/347053 (https://phabricator.wikimedia.org/T160908) [16:29:42] (03CR) 10Andrew Bogott: [C: 032] nova.conf: Reduce fixed_ip_disassociate_timeout to three minutes. [puppet] - 10https://gerrit.wikimedia.org/r/347053 (https://phabricator.wikimedia.org/T160908) (owner: 10Andrew Bogott) [16:29:57] (03PS2) 10Andrew Bogott: nova.conf: change dhcp lease times to 12 hours. [puppet] - 10https://gerrit.wikimedia.org/r/347054 (https://phabricator.wikimedia.org/T160908) [16:30:48] (03PS1) 10Dzahn: site: remove all the "$cluster = 'misc'" [puppet] - 10https://gerrit.wikimedia.org/r/347867 [16:32:05] (03CR) 10Andrew Bogott: [C: 031] nova.conf: change dhcp lease times to 12 hours. [puppet] - 10https://gerrit.wikimedia.org/r/347054 (https://phabricator.wikimedia.org/T160908) (owner: 10Andrew Bogott) [16:32:19] (03CR) 10Andrew Bogott: [C: 032] nova.conf: change dhcp lease times to 12 hours. [puppet] - 10https://gerrit.wikimedia.org/r/347054 (https://phabricator.wikimedia.org/T160908) (owner: 10Andrew Bogott) [16:32:57] RainbowSprinkles: fyi it looks like there were undeployed changes to Flow. I'm going to scap-dir to only get FundraiserLandingPage. [16:33:17] PROBLEM - puppet last run on analytics1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:33:33] awight: ok [16:34:40] (03PS1) 10Giuseppe Lavagetto: Change the RO message we match [switchdc] - 10https://gerrit.wikimedia.org/r/347868 [16:34:42] (03PS1) 10Giuseppe Lavagetto: Move varnish puppet disabling in t00 [switchdc] - 10https://gerrit.wikimedia.org/r/347869 [16:34:44] (03PS1) 10Giuseppe Lavagetto: Warmup: remove api warmup, useless, and add a log line. [switchdc] - 10https://gerrit.wikimedia.org/r/347870 [16:35:20] !log awight@tin Synchronized php-1.29.0-wmf.19/extensions/FundraiserLandingPage: Fix for donatewiki T162716 (duration: 00m 48s) [16:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:27] T162716: Fix raw HTML tags breaking 1.29.0-wmf19+ - https://phabricator.wikimedia.org/T162716 [16:35:28] (03PS7) 10Paladox: Create an icinga2 class [puppet] - 10https://gerrit.wikimedia.org/r/347640 [16:35:35] (03PS8) 10Paladox: Create an icinga2 class [puppet] - 10https://gerrit.wikimedia.org/r/347640 [16:36:28] (03PS1) 10Dzahn: site/thumbor: remove duplicate base::firewall include [puppet] - 10https://gerrit.wikimedia.org/r/347873 [16:36:29] (03CR) 10jerkins-bot: [V: 04-1] Create an icinga2 class [puppet] - 10https://gerrit.wikimedia.org/r/347640 (owner: 10Paladox) [16:36:53] (03PS9) 10Paladox: Create an icinga2 class [puppet] - 10https://gerrit.wikimedia.org/r/347640 [16:37:31] !log awight@tin Synchronized php-1.29.0-wmf.20/extensions/FundraiserLandingPage: Fix for donatewiki T162716 (duration: 00m 45s) [16:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:53] (03PS1) 10Awight: Put donatewiki back on the train [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347875 [16:41:32] (03PS1) 10Dzahn: site/ganglia: move firewall/standard/IPv6 to role [puppet] - 10https://gerrit.wikimedia.org/r/347876 [16:41:49] (03PS1) 10Hashar: jenkins: use configuration file for logging [puppet] - 10https://gerrit.wikimedia.org/r/347877 [16:42:00] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3175540 (10Rameshti) 05stalled>03Open >>! In T161529#3170938, @Rameshti wrote: > Project name: विकिपिडिया > Project namespace: विकिपिडिया > Project talk namespace: विकिपिडिया_क... [16:42:02] (03CR) 10Awight: [C: 032] Put donatewiki back on the train [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347875 (owner: 10Awight) [16:42:36] (03CR) 10Hashar: [C: 04-1] "Chad, in case java logging properties make sense to you :-} I have no idea what are $1 / $2 etc." [puppet] - 10https://gerrit.wikimedia.org/r/347877 (owner: 10Hashar) [16:43:37] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3175543 (10Rameshti) I agree [16:44:39] (03Merged) 10jenkins-bot: Put donatewiki back on the train [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347875 (owner: 10Awight) [16:44:52] (03CR) 10jenkins-bot: Put donatewiki back on the train [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347875 (owner: 10Awight) [16:46:32] (03CR) 10Volans: [C: 04-1] "AFAIK we still want to bump the wmf-config one to 15 minutes, or maybe a lesser number but not 3 minutes for the switchover period." [switchdc] - 10https://gerrit.wikimedia.org/r/347868 (owner: 10Giuseppe Lavagetto) [16:46:36] !log awight@tin rebuilt wikiversions.php and synchronized wikiversions files: (no justification provided) [16:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:27] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [16:51:23] (03CR) 10Andrew Bogott: [C: 032] openstack: Decrease by 2 the backups held [puppet] - 10https://gerrit.wikimedia.org/r/347851 (https://phabricator.wikimedia.org/T159524) (owner: 10Alexandros Kosiaris) [16:51:42] (03PS2) 10Andrew Bogott: openstack: Decrease by 2 the backups held [puppet] - 10https://gerrit.wikimedia.org/r/347851 (https://phabricator.wikimedia.org/T159524) (owner: 10Alexandros Kosiaris) [16:51:56] who can/should i talk to about setting up alerts on grafana? [16:52:07] it doesn't seem to be able to send emails :/ [16:52:24] 06Operations, 06Labs, 13Patch-For-Review: Instance creation fails before first puppet run around 1% of the time - https://phabricator.wikimedia.org/T160908#3114255 (10hashar) Since you mention DHCP leases. For the Nodepool images I have hunt slowness in the boot process and I got rid of a few ones related to... [16:52:47] phuedx: godog most probably [16:53:10] phuedx: the grafana alerting is good for play testing, but once happy with the query that should be made an Icinga check [16:53:22] using check_graphite or something like that [16:53:39] * phuedx fears he's about to go down a rabbit hole [16:54:11] phuedx: yeah what hashar said, or mimick what the performance team did with check_grafana_alert [16:54:57] at the bottom of the rabbit hole there might be some data above the critical threshold [16:55:36] I am off *wave* [16:56:19] godog: check_grafana_alert? [16:57:07] phuedx: yeah, basically checks the dashboard url itself [16:57:20] (03CR) 10Krinkle: [C: 031] Warmup: remove api warmup, useless, and add a log line. [switchdc] - 10https://gerrit.wikimedia.org/r/347870 (owner: 10Giuseppe Lavagetto) [16:59:15] 06Operations, 10Icinga, 06Labs, 10Labs-Infrastructure: remove/fix "Check for gridmaster host resolution" Icinga check for "labtest" - https://phabricator.wikimedia.org/T152024#3175600 (10Dzahn) p:05Low>03Normal just noticed this in Icinga again. still there. duration now 349 days. [17:00:08] godog: hrrm. where are the icinga alerts defined? [17:00:58] 06Operations, 10Icinga, 06Labs, 10Labs-Infrastructure: remove/fix "Check for gridmaster host resolution" Icinga check for "labtest" - https://phabricator.wikimedia.org/T152024#3175605 (10Dzahn) a:03Dzahn [17:01:17] RECOVERY - puppet last run on analytics1054 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [17:01:20] phuedx: all in the puppet repo, you likely want modules/monitoring/manifests/grafana_alert.pp [17:02:51] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3175608 (10Dzahn) That said, it's still better to change it now rather than a more complicated "rename it"-ticket later. [17:03:07] woo [17:03:20] so i can learn just enough to be dangerous ;D [17:03:21] (03CR) 10Chad: "find: The -delete action atomatically turns on -depth, but -prune does nothing when -depth is in effect. If you want to carry on anyway, " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347633 (owner: 10Chad) [17:04:10] thanks godog! [17:05:53] 06Operations, 10TimedMediaHandler-Transcode: Videoscalers overloaded once in a while triggering alarms - https://phabricator.wikimedia.org/T162815#3175646 (10elukey) [17:06:48] 06Operations, 10TimedMediaHandler-Transcode, 15User-Elukey: Videoscalers overloaded once in a while triggering alarms - https://phabricator.wikimedia.org/T162815#3175663 (10elukey) [17:07:02] opened --^ for the videoscalers issue that we encountered today [17:07:23] * elukey off! [17:09:41] 06Operations, 10DNS, 06Services, 10Traffic: icinga alerts on nodejs services when a recdns server is depooled - https://phabricator.wikimedia.org/T162818#3175699 (10BBlack) [17:09:43] 06Operations, 07HHVM: Frequent TCP RST on connections between HHVM and Redis - https://phabricator.wikimedia.org/T162354#3175712 (10elukey) https://github.com/facebook/hhvm/commit/671c259942fc2212ac346cd9830a31ec5d545c1b is the related commit in the master branch of the hhvm upstream repo. @MoritzMuehlenhoff... [17:12:50] (03PS1) 10Dzahn: site/redis_slave: remove duplicate base::firewall include [puppet] - 10https://gerrit.wikimedia.org/r/347881 [17:19:44] (03PS1) 10Dzahn: site/prometheus: rm duplicate base::firewall, mv standard to role [puppet] - 10https://gerrit.wikimedia.org/r/347883 [17:23:04] (03PS1) 10Andrew Bogott: Repool labvirt1002. [puppet] - 10https://gerrit.wikimedia.org/r/347884 (https://phabricator.wikimedia.org/T162640) [17:25:09] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3175790 (10Cmjohnson) Great! I will decom those 2 servers and utilize their space. [17:25:15] (03CR) 10Andrew Bogott: [C: 032] Repool labvirt1002. [puppet] - 10https://gerrit.wikimedia.org/r/347884 (https://phabricator.wikimedia.org/T162640) (owner: 10Andrew Bogott) [17:25:47] 06Operations, 10Continuous-Integration-Infrastructure, 10Icinga, 06Release-Engineering-Team: remove/fix jenkins icinga monitoring on contin2001 - https://phabricator.wikimedia.org/T162822#3175792 (10Dzahn) [17:25:55] 06Operations, 10Continuous-Integration-Infrastructure, 10Icinga, 06Release-Engineering-Team: remove/fix jenkins icinga monitoring on contin2001 - https://phabricator.wikimedia.org/T162822#3175808 (10Dzahn) a:05RobH>03None [17:27:14] 06Operations, 10Continuous-Integration-Infrastructure, 10Icinga, 06Release-Engineering-Team: remove/fix jenkins icinga monitoring on contin2001 - https://phabricator.wikimedia.org/T162822#3175792 (10Dzahn) ah, actually this is T150771 and just like what we did for zuul (https://gerrit.wikimedia.org/r/327695) [17:27:34] 06Operations, 10Continuous-Integration-Infrastructure, 10Icinga, 06Release-Engineering-Team: remove/fix jenkins icinga monitoring on contin2001 - https://phabricator.wikimedia.org/T162822#3175818 (10Dzahn) a:03Dzahn [17:27:47] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:27:54] 06Operations, 10Continuous-Integration-Infrastructure, 10Icinga, 06Release-Engineering-Team: remove/fix jenkins icinga monitoring on contint2001 - https://phabricator.wikimedia.org/T162822#3175792 (10Dzahn) [17:28:04] (03PS1) 10Andrew Bogott: Repool labvirt1001. [puppet] - 10https://gerrit.wikimedia.org/r/347887 (https://phabricator.wikimedia.org/T159835) [17:28:14] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: Secondary production Jenkins for CI - https://phabricator.wikimedia.org/T150771#2795939 (10Dzahn) [17:28:17] 06Operations, 10Continuous-Integration-Infrastructure, 10Icinga, 06Release-Engineering-Team: remove/fix jenkins icinga monitoring on contint2001 - https://phabricator.wikimedia.org/T162822#3175792 (10Dzahn) [17:28:32] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3175826 (10Ramesh_Bohara) 05Open>03Resolved a:03Ramesh_Bohara In T161529#3170855, @Janak_bhatta wrote: Project name: विकिपिडिया Project namespace: विकिपिडिया Project talk nam... [17:28:36] (03PS2) 10Andrew Bogott: Repool labvirt1001 [puppet] - 10https://gerrit.wikimedia.org/r/347887 (https://phabricator.wikimedia.org/T159835) [17:29:39] uh, what's happened at https://phabricator.wikimedia.org/T161529 ? [17:29:55] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3175832 (10Urbanecm) 05Resolved>03stalled a:05Ramesh_Bohara>03None Wasn't done so reopening&declaiming @Ramesh_Bohara. I think it is stalled still so marking as stalled til... [17:29:57] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3175836 (10DatGuy) This isn't resolved. [17:30:07] cheers Urbanecm [17:30:37] DatGuy: Hi :) [17:30:51] I'll finish it probably tomorrow [17:30:57] DatGuy: you mean config? [17:31:02] +logo [17:31:05] (03PS17) 10Matthias Mullie: Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [17:31:08] Great! [17:31:26] Then somebody with corresponding permissions would actually create the wiki :) [17:32:16] the scap stuff :P [17:33:19] And database, yeah ;) [17:33:34] (03PS1) 10Jdlrobson: Tweak Russian logo wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347888 (https://phabricator.wikimedia.org/T162036) [17:33:58] and running create_wiki [17:34:27] and mwscript [17:34:27] mutante: :) [17:34:33] :p [17:34:39] DatGuy: that is what mutante meant I think :D [17:35:04] ah yead, addWiki.php [17:35:44] eh, yea, sorry [17:37:13] (03PS2) 10Jdlrobson: Tweak Russian logo wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347888 (https://phabricator.wikimedia.org/T162036) [17:40:41] (03PS1) 10Cmjohnson: removing mgmt dns entries for wiped and removed db1019 and db1042 [dns] - 10https://gerrit.wikimedia.org/r/347889 [17:41:50] (03CR) 10Cmjohnson: [C: 032] removing mgmt dns entries for wiped and removed db1019 and db1042 [dns] - 10https://gerrit.wikimedia.org/r/347889 (owner: 10Cmjohnson) [17:42:27] 06Operations, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Set up LVS for labs dns recursors - https://phabricator.wikimedia.org/T119660#3175864 (10Andrew) a:05Andrew>03None [17:43:36] 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3175872 (10Cmjohnson) [17:43:38] 06Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 13Patch-For-Review: Decommission db1042 - https://phabricator.wikimedia.org/T149793#3175870 (10Cmjohnson) 05Open>03Resolved Removed from dns, removed from rack updated racktables and resolving [17:44:10] 06Operations, 10ops-eqiad, 10hardware-requests: Decommission db1019 - https://phabricator.wikimedia.org/T147309#3175875 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson wiped, removed from rack, racktables updated [17:44:12] 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#2266762 (10Cmjohnson) [17:45:47] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [17:47:45] 06Operations, 07Availability: Set databases as read-only or switchover to secondary datacenter - https://phabricator.wikimedia.org/T138810#3175886 (10jcrespo) [17:54:29] (03PS1) 10Dzahn: planet/varnish-misc: switch planet to active-active [puppet] - 10https://gerrit.wikimedia.org/r/347892 [17:58:11] (03PS2) 10Dzahn: planet/varnish-misc: switch planet to active-active [puppet] - 10https://gerrit.wikimedia.org/r/347892 [17:58:49] (03PS3) 10Dzahn: planet/varnish-misc: switch planet to active-active [puppet] - 10https://gerrit.wikimedia.org/r/347892 [17:59:16] 06Operations, 13Patch-For-Review: Miscellaneous servers to track in eqiad for possible inclusion in codfw misc virt cluster - https://phabricator.wikimedia.org/T88761#3175931 (10Dzahn) a:03Dzahn [17:59:26] 06Operations, 13Patch-For-Review: Miscellaneous servers to track in eqiad for possible inclusion in codfw misc virt cluster - https://phabricator.wikimedia.org/T88761#1019786 (10Dzahn) p:05Low>03Normal [18:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170412T1800). Please do the needful. [18:00:04] Jdlrobson: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:09] present [18:00:55] (03PS1) 10Matthias Mullie: Run 3d2png with xfvb-run [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347893 (https://phabricator.wikimedia.org/T159717) [18:01:20] 06Operations, 13Patch-For-Review: Miscellaneous servers to track in eqiad for possible inclusion in codfw misc virt cluster - https://phabricator.wikimedia.org/T88761#3175942 (10Dzahn) >>! In T88761#2304022, @akosiaris wrote: > Now the planet-noobie question: why only have the feed update active in the current... [18:02:03] who's doing swat today? [18:02:51] I can SWAT [18:04:18] (03PS1) 10Chad: group1 to wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347894 [18:04:33] (03CR) 10Chad: [C: 04-2] "NOT TIL I SAY SO" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347894 (owner: 10Chad) [18:04:52] thcipriani: w00t [18:05:24] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347888 (https://phabricator.wikimedia.org/T162036) (owner: 10Jdlrobson) [18:05:33] RainbowSprinkles: and now somebody removes your CR :P [18:08:23] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3175952 (10Cmjohnson) @marostegui Final Placement hostname rack db1096 a6 db1097 d1 db1098 b5 db1099 b2 db1100 c2 db1101 c2 (you can use db1057 slot, as db1057 can go away:... [18:09:34] (03Merged) 10jenkins-bot: Tweak Russian logo wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347888 (https://phabricator.wikimedia.org/T162036) (owner: 10Jdlrobson) [18:09:44] (03CR) 10jenkins-bot: Tweak Russian logo wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347888 (https://phabricator.wikimedia.org/T162036) (owner: 10Jdlrobson) [18:10:00] Sagan: If they do, I'll find them and beat them senseless with a little stick [18:10:31] A little stick, so it will take awhile [18:10:56] RainbowSprinkles: cool to revert wikiversions.json on tin for SWAT? [18:11:03] Oh, whoops [18:11:03] Yes [18:11:07] * thcipriani does [18:11:54] jdlrobson: russion logo wordmark is updated on mwdebug1002, check please [18:12:01] *Russian [18:13:48] thcipriani: testing [18:14:14] thcipriani: good! [18:14:18] thcipriani: swat away [18:14:20] jdlrobson: ok syncing [18:16:57] !log thcipriani@tin Synchronized static/images/mobile/copyright/wikipedia-wordmark-ru.svg: SWAT: [[gerrit:347888|Tweak Russian logo wordmark]] T162036 PART I (duration: 00m 43s) [18:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:06] T162036: Rendering issues with logo in Russian Wikipedia on mobile - https://phabricator.wikimedia.org/T162036 [18:18:04] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:347888|Tweak Russian logo wordmark]] T162036 PART II (duration: 00m 43s) [18:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:16] ^ jdlrobson live [18:18:27] thcipriani: thanks :) which one next? [18:18:35] well let's see [18:20:36] jdlrobson: setMobileOptions at time of skin creation is live for wmf.20 on mwdebug1002, check please [18:20:41] on it! [18:24:33] thcipriani: sync away :) all good [18:24:47] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 0 [18:25:05] * thcipriani does [18:26:56] !log thcipriani@tin Synchronized php-1.29.0-wmf.20/extensions/MobileFrontend: SWAT: [[gerrit:347885|setMobileOptions at time of skin creation]] T125588 (duration: 00m 46s) [18:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:04] T125588: SkinMinerva and SkinMinervaBeta should not know about MobileContext - https://phabricator.wikimedia.org/T125588 [18:27:11] ^ jdlrobson live now [18:27:21] sweet [18:27:40] the last one i cant test btw. It just changes a log level for an error [18:27:48] gotcha [18:27:49] so only test is we dont blow anything up :) [18:28:54] hrm tried to cherry pick that one to wmf.20 via gerrit and it gave me an error [18:29:02] could you make a cherry-pick manually? [18:29:03] the 2nd one depends on the 1st one [18:29:23] ah, that'll do it :) [18:30:55] !log ppchelko@tin Started deploy [changeprop/deploy@0a9a008]: Config: Send ORES precache requests to both DCs. T159615 [18:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:02] T159615: [spec] Active-active setup for ORES across datacenters (eqiad, codfw) - https://phabricator.wikimedia.org/T159615 [18:33:12] jdlrobson: log channel change is live on mwdebug1002, check please [18:33:37] PROBLEM - Check systemd state on scb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:34:07] PROBLEM - changeprop endpoints health on scb2001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.32.132, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [18:35:45] we thcipriani sweet [18:36:02] sync away [18:36:10] is...is that good? ah, ok, going. [18:36:49] ChangeProp on SCB alert is under control is un [18:37:48] !log ppchelko@tin Finished deploy [changeprop/deploy@0a9a008]: Config: Send ORES precache requests to both DCs. T159615 (duration: 06m 53s) [18:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:55] T159615: [spec] Active-active setup for ORES across datacenters (eqiad, codfw) - https://phabricator.wikimedia.org/T159615 [18:38:24] !log thcipriani@tin Synchronized php-1.29.0-wmf.20/extensions/MobileFrontend: SWAT: [[gerrit:347886|formatter: Change log channel of infobox message]] T149884 (duration: 00m 46s) [18:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:30] T149884: Log instances of infoboxes being wrapped in containers - https://phabricator.wikimedia.org/T149884 [18:38:31] ^ jdlrobson live now [18:38:38] thcipriani: both or just the one? [18:38:41] 06Operations, 06Commons, 10Datasets-General-or-Unknown, 07Community-Wishlist-Survey-2016: Back up of Commons files - https://phabricator.wikimedia.org/T160229#3176038 (10Tshrinivasan) Like to know what is the current backup policy for commons, total storage size, and the architecture of the servers/storage... [18:38:56] jdlrobson: just the one, waiting on jenkins to finish up with the other [18:39:00] ok great :) [18:39:03] just checking [18:42:32] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3176045 (10Marostegui) [18:42:45] jdlrobson: ok, just merged, going to push that change out now [18:43:00] !log ppchelko@tin Started deploy [changeprop/deploy@e403f56]: Config: Send ORES precache requests to both DCs. Attempt #2. T159615 [18:43:07] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3156297 (10Marostegui) >>! In T162233#3175952, @Cmjohnson wrote: > @marostegui Final Placement > > hostname rack > db1096 a6 > db1097 d1 > db1098 b5 > db1099 b2 > db1100 c2... [18:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:08] T159615: [spec] Active-active setup for ORES across datacenters (eqiad, codfw) - https://phabricator.wikimedia.org/T159615 [18:43:25] (03PS10) 10Paladox: Create an icinga2 class [puppet] - 10https://gerrit.wikimedia.org/r/347640 [18:43:37] RECOVERY - Check systemd state on scb2001 is OK: OK - running: The system is fully operational [18:44:07] RECOVERY - changeprop endpoints health on scb2001 is OK: All endpoints are healthy [18:44:16] !log ppchelko@tin Finished deploy [changeprop/deploy@e403f56]: Config: Send ORES precache requests to both DCs. Attempt #2. T159615 (duration: 01m 15s) [18:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:37] !log thcipriani@tin Synchronized php-1.29.0-wmf.20/extensions/MobileFrontend: SWAT: [[gerrit:347802|formatter: Increase log level of infobox message]] T149884 (duration: 00m 46s) [18:45:43] ^ jdlrobson all live now [18:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:44] T149884: Log instances of infoboxes being wrapped in containers - https://phabricator.wikimedia.org/T149884 [18:45:51] W00t [18:45:53] thanks thcipriani [18:45:59] * jdlrobson does little dance [18:46:03] np :) [18:51:57] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [18:53:07] (03PS1) 10Thcipriani: Scap: set deployment_server correctly [puppet] - 10https://gerrit.wikimedia.org/r/347898 (https://phabricator.wikimedia.org/T162814) [18:54:57] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [19:00:04] RainbowSprinkles: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170412T1900). [19:02:03] (03CR) 10Chad: [C: 032] group1 to wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347894 (owner: 10Chad) [19:03:13] (03Merged) 10jenkins-bot: group1 to wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347894 (owner: 10Chad) [19:03:26] (03CR) 10jenkins-bot: group1 to wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347894 (owner: 10Chad) [19:05:08] !log demon@tin Synchronized php: symlink bump (duration: 00m 42s) [19:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:37] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 to wmf.20 [19:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:24] 06Operations, 10ORES, 10Revision-Scoring-As-A-Service-Backlog, 06Services (done), 15User-mobrovac: [spec] Active-active setup for ORES across datacenters (eqiad, codfw) - https://phabricator.wikimedia.org/T159615#3176179 (10Pchelolo) 05Open>03Resolved a:03Pchelolo The prefacing rule is now updating... [19:08:26] 06Operations, 10Revision-Scoring-As-A-Service-Backlog, 13Patch-For-Review: Set up oresrdb redis node in codfw - https://phabricator.wikimedia.org/T139372#3176183 (10Pchelolo) [19:08:33] (03Draft1) 10Paladox: icinga: Remove icinga-init.sh file, its now provided by the package [puppet] - 10https://gerrit.wikimedia.org/r/347899 [19:08:35] (03PS2) 10Paladox: icinga: Remove icinga-init.sh file, its now provided by the package [puppet] - 10https://gerrit.wikimedia.org/r/347899 [19:10:23] 06Operations, 06Commons, 10Datasets-General-or-Unknown, 07Community-Wishlist-Survey-2016: Back up of Commons files - https://phabricator.wikimedia.org/T160229#3176187 (10Dzahn) here are some links about the storage architecture that holds images (swift) https://wikitech.wikimedia.org/wiki/Swift https://wi... [19:12:47] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [19:13:08] (03PS1) 10BryanDavis: wikitech: Enable binary memcached protocol [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347900 (https://phabricator.wikimedia.org/T158613) [19:13:14] (03PS4) 10Dzahn: planet/varnish-misc: switch planet to active-active [puppet] - 10https://gerrit.wikimedia.org/r/347892 [19:14:47] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.00 seconds [19:15:02] (03CR) 10Chad: "mc-labs isn't wikitech, it's beta...." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347900 (https://phabricator.wikimedia.org/T158613) (owner: 10BryanDavis) [19:15:29] RainbowSprinkles: heh. well that's not helpful then is it :) [19:16:38] Not really :) [19:16:47] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [19:17:03] I guess it needs to go in mc.php with a HHVM_VERSION guard [19:17:07] * bd808 amends [19:17:47] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 18 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [19:18:57] (03PS2) 10BryanDavis: wikitech: Enable binary memcached protocol [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347900 (https://phabricator.wikimedia.org/T158613) [19:20:01] (03CR) 10Chad: "Minor nitpick, but lgtm." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347900 (https://phabricator.wikimedia.org/T158613) (owner: 10BryanDavis) [19:20:49] (03PS3) 10BryanDavis: wikitech: Enable binary memcached protocol [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347900 (https://phabricator.wikimedia.org/T158613) [19:21:05] (03CR) 10BryanDavis: wikitech: Enable binary memcached protocol (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347900 (https://phabricator.wikimedia.org/T158613) (owner: 10BryanDavis) [19:22:40] RainbowSprinkles: should I put that in the next SWAT window? [19:25:19] Or just now [19:26:37] now would be swell [19:27:30] are you mid-train? [19:29:35] Nope [19:29:37] Train is done [19:29:39] Lemme do it [19:29:53] (03CR) 10Chad: [C: 032] wikitech: Enable binary memcached protocol [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347900 (https://phabricator.wikimedia.org/T158613) (owner: 10BryanDavis) [19:30:24] my hero [19:30:53] (03Merged) 10jenkins-bot: wikitech: Enable binary memcached protocol [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347900 (https://phabricator.wikimedia.org/T158613) (owner: 10BryanDavis) [19:31:02] (03CR) 10jenkins-bot: wikitech: Enable binary memcached protocol [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347900 (https://phabricator.wikimedia.org/T158613) (owner: 10BryanDavis) [19:31:14] https://www.youtube.com/watch?v=koJlIGDImiU [19:31:18] bd808: ^ [19:33:22] !log demon@tin Synchronized wmf-config/mc.php: Bryan made me do it (duration: 00m 43s) [19:33:22] I'll put you down for that at the next karaoke night [19:33:23] demon@tin: Failed to log message to wiki. Somebody should check the error logs. [19:33:46] yikes. wazzup stashbot? [19:34:43] mwoauth-invalid-authorization nonce already used :/ [19:36:55] !log testing stashbot post-restart [19:36:56] bd808: Failed to log message to wiki. Somebody should check the error logs. [19:37:22] this smells like something related to the memcached change [19:37:46] OAuthAuthorizationError: The authorization headers in your request are not valid: Nonce already used [19:38:03] (03CR) 10Ejegg: [C: 04-1] "Looks like this needs a change to wmf-config/InitialiseSettings.php so foundationwiki points to wmf.png & HD versions, not foundationwiki." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307475 (https://phabricator.wikimedia.org/T144254) (owner: 10Urbanecm) [19:39:11] 06Operations, 10MediaWiki-Cache, 06Performance-Team, 10Traffic: Duplicate CdnCacheUpdate on subsequent edits - https://phabricator.wikimedia.org/T145643#3176241 (10aaron) 05Open>03declined The rebound purge is deliberate and hard to de-duplicate in any case (unless two purges came in at the same time a... [19:39:25] RainbowSprinkles: looks like that change did bad things to wikitech :/ [19:39:44] (03PS2) 10Thcipriani: Scap: set deployment_server correctly [puppet] - 10https://gerrit.wikimedia.org/r/347898 (https://phabricator.wikimedia.org/T162814) [19:39:49] (03PS1) 10BryanDavis: Revert "wikitech: Enable binary memcached protocol" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347905 [19:40:11] (03CR) 10BryanDavis: [C: 032] Revert "wikitech: Enable binary memcached protocol" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347905 (owner: 10BryanDavis) [19:40:30] bd808: Wherps [19:40:45] meh. rollback is in the pipeline [19:41:23] (03Merged) 10jenkins-bot: Revert "wikitech: Enable binary memcached protocol" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347905 (owner: 10BryanDavis) [19:41:27] (03PS1) 10Awight: Link to the wikitech page [dumps] - 10https://gerrit.wikimedia.org/r/347906 [19:41:28] (03PS1) 10Awight: Update instructions for fetching mwbzutils source [dumps] - 10https://gerrit.wikimedia.org/r/347907 [19:41:31] (03PS1) 10Awight: Document quote gotcha; include new binary path [dumps] - 10https://gerrit.wikimedia.org/r/347908 [19:41:33] (03CR) 10jenkins-bot: Revert "wikitech: Enable binary memcached protocol" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347905 (owner: 10BryanDavis) [19:42:32] You already reverting on tin? [19:42:37] yeah [19:42:47] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] [19:42:57] !log bd808@tin Synchronized wmf-config/mc.php: Revert "wikitech: Enable binary memcached protocol" (duration: 00m 43s) [19:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:14] that looks better laready [19:44:46] I can log into wikitech again and stashbot can too [19:45:06] I'll go looking for error logs explaining what went wrong [19:45:25] (03CR) 10Ejegg: [C: 04-1] "Should this patch also update static/images/wikimedia-button.png , which is in the footer of a lot of projects?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/307475 (https://phabricator.wikimedia.org/T144254) (owner: 10Urbanecm) [19:47:35] RainbowSprinkles: I read bugs badly and missed "We can't just use BINARY because we use twemproxy, which is ASCII only." [19:50:47] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] [19:51:08] is that ^ from wikitech or something else? [19:51:34] (03PS2) 10Volans: Change the RO message we match [switchdc] - 10https://gerrit.wikimedia.org/r/347868 (https://phabricator.wikimedia.org/T160178) (owner: 10Giuseppe Lavagetto) [19:52:51] looks like the memcached error spike that icinga is pointing out was the wikitech config change. The rate looks to be back to normal on logstash [19:52:57] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [19:58:20] (03CR) 10Dzahn: [C: 032] planet/varnish-misc: switch planet to active-active [puppet] - 10https://gerrit.wikimedia.org/r/347892 (owner: 10Dzahn) [19:58:54] (03PS5) 10Dzahn: planet/varnish-misc: switch planet to active-active [puppet] - 10https://gerrit.wikimedia.org/r/347892 [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170412T2000). [20:00:27] time to deploy parsoid ... [20:02:47] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] [20:03:26] !log ssastry@tin Started deploy [parsoid/deploy@323cebb]: Updating Parsoid to 75debae3 [20:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:15] no mobileapps deploy today [20:07:37] bearND: thanks for the heads up [20:07:52] !log planet2001 - activating all the crons, making planet active/active eqiad/codfw [20:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:57] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [20:11:35] 06Operations, 10ops-codfw, 06DC-Ops, 10netops: setup wifi in codfw - https://phabricator.wikimedia.org/T86541#3176312 (10RobH) p:05High>03Normal a:05faidon>03ayounsi I chatted with @ayounsi about this via IRC. He is now aware of this pending task, though it isn't high priority. Basically codfw ha... [20:12:24] 06Operations, 10ops-codfw, 06DC-Ops, 10netops: setup wifi in codfw - https://phabricator.wikimedia.org/T86541#3176317 (10RobH) [20:12:42] !log ssastry@tin Finished deploy [parsoid/deploy@323cebb]: Updating Parsoid to 75debae3 (duration: 09m 16s) [20:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:02] 06Operations, 10ops-codfw, 06DC-Ops, 10netops: setup wifi in codfw - https://phabricator.wikimedia.org/T86541#970935 (10RobH) [20:13:24] 06Operations, 10ops-codfw, 06DC-Ops, 10netops: setup wifi in codfw - https://phabricator.wikimedia.org/T86541#970935 (10RobH) [20:16:51] (03PS17) 10Andrew Bogott: wmfkeystonehooks: Create project page on wikitech on project creation [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091) [20:18:31] (03PS1) 10Thcipriani: Scap: Remove git_server from scap.cfg [software/prometheus_jmx_exporter] - 10https://gerrit.wikimedia.org/r/347924 (https://phabricator.wikimedia.org/T162814) [20:19:06] jouncebot: now [20:19:07] For the next 0 hour(s) and 40 minute(s): Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170412T2000) [20:19:07] For the next 0 hour(s) and 40 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170412T1900) [20:20:19] (03CR) 10Andrew Bogott: [C: 032] wmfkeystonehooks: Create project page on wikitech on project creation [puppet] - 10https://gerrit.wikimedia.org/r/323117 (https://phabricator.wikimedia.org/T150091) (owner: 10Andrew Bogott) [20:21:14] (03PS1) 10BBlack: LVS: remove direct use of acamar recdns [puppet] - 10https://gerrit.wikimedia.org/r/347932 (https://phabricator.wikimedia.org/T155411) [20:21:38] (03CR) 10BBlack: [V: 032 C: 032] LVS: remove direct use of acamar recdns [puppet] - 10https://gerrit.wikimedia.org/r/347932 (https://phabricator.wikimedia.org/T155411) (owner: 10BBlack) [20:21:43] (03PS2) 10BBlack: LVS: remove direct use of acamar recdns [puppet] - 10https://gerrit.wikimedia.org/r/347932 (https://phabricator.wikimedia.org/T155411) [20:21:55] (03CR) 10BBlack: [V: 032 C: 032] LVS: remove direct use of acamar recdns [puppet] - 10https://gerrit.wikimedia.org/r/347932 (https://phabricator.wikimedia.org/T155411) (owner: 10BBlack) [20:23:43] (03PS3) 10BBlack: acamar/achernar -> jessie [puppet] - 10https://gerrit.wikimedia.org/r/347843 (https://phabricator.wikimedia.org/T155411) [20:23:50] (03CR) 10BBlack: [V: 032 C: 032] acamar/achernar -> jessie [puppet] - 10https://gerrit.wikimedia.org/r/347843 (https://phabricator.wikimedia.org/T155411) (owner: 10BBlack) [20:25:30] !log planet2001 - manually updating all feeds to make it active (or would have to wait for crons) [20:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:22] when possible can someone answer this: When changing a namespace per request at phab do i do it in mw/core i18n file for that lang and thats it or is there more i need to do? [20:32:04] (03PS2) 10Legoktm: Deploy Linter to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347439 (https://phabricator.wikimedia.org/T148609) [20:40:41] (03CR) 10Legoktm: [C: 032] Deploy Linter to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347439 (https://phabricator.wikimedia.org/T148609) (owner: 10Legoktm) [20:42:14] (03Merged) 10jenkins-bot: Deploy Linter to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347439 (https://phabricator.wikimedia.org/T148609) (owner: 10Legoktm) [20:42:27] (03CR) 10jenkins-bot: Deploy Linter to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347439 (https://phabricator.wikimedia.org/T148609) (owner: 10Legoktm) [20:42:29] (03CR) 10Dzahn: [C: 032] site/ganglia: move firewall/standard/IPv6 to role [puppet] - 10https://gerrit.wikimedia.org/r/347876 (owner: 10Dzahn) [20:42:32] !log bblack@neodymium conftool action : set/pooled=no; selector: name=acamar.wikimedia.org,dc=codfw,cluster=dns,service=pdns_recursor [20:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:07] (03PS2) 10Dzahn: site/thumbor: remove duplicate base::firewall include [puppet] - 10https://gerrit.wikimedia.org/r/347873 [20:44:04] !log legoktm@tin Synchronized wmf-config/InitialiseSettings.php: Deploy Linter to all wikis - T148609 (duration: 00m 44s) [20:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:12] T148609: Review and deploy Linter extension to Wikimedia wikis - https://phabricator.wikimedia.org/T148609 [20:45:00] (03CR) 10Dzahn: [C: 032] site/thumbor: remove duplicate base::firewall include [puppet] - 10https://gerrit.wikimedia.org/r/347873 (owner: 10Dzahn) [20:45:07] PROBLEM - puppet last run on db2039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:45:12] (03PS2) 10Dzahn: site/ganglia: move firewall/standard/IPv6 to role [puppet] - 10https://gerrit.wikimedia.org/r/347876 [20:45:23] (03CR) 10Mobrovac: [C: 031] service::node - add a defined() guard on git deployment [puppet] - 10https://gerrit.wikimedia.org/r/347855 (owner: 10Gehel) [20:46:33] 06Operations, 13Patch-For-Review: Reimage achernar and acamar to jessie - https://phabricator.wikimedia.org/T155411#3176435 (10BBlack) [20:48:07] (03PS2) 10Dzahn: site/redis_slave: remove duplicate base::firewall include [puppet] - 10https://gerrit.wikimedia.org/r/347881 [20:52:58] 06Operations, 13Patch-For-Review: Reimage achernar and acamar to jessie - https://phabricator.wikimedia.org/T155411#2942665 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['acamar.wikimedia.org'] ``` The log can be found in `/var/log/wmf-auto-reima... [20:54:37] (03PS1) 10Andrew Bogott: wmfkeystonehooks: Fixup for project creation [puppet] - 10https://gerrit.wikimedia.org/r/347975 [20:55:27] PROBLEM - Host 2620:0:860:1:208:80:153:12 is DOWN: CRITICAL - Destination Unreachable (2620:0:860:1:208:80:153:12) [20:55:37] PROBLEM - Host 208.80.153.12 is DOWN: PING CRITICAL - Packet loss = 100% [20:56:47] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: http status 500 [20:57:21] the 208.80.153.12 stuff above is acamar being reinstalled. I'm not sure why it has duplicate icinga monitors under its raw IP addresses [20:57:36] (the ones for its real hostname are downtimed) [20:57:57] (03CR) 10Andrew Bogott: [C: 032] wmfkeystonehooks: Fixup for project creation [puppet] - 10https://gerrit.wikimedia.org/r/347975 (owner: 10Andrew Bogott) [21:00:47] RECOVERY - Host 208.80.153.12 is UP: PING OK - Packet loss = 0%, RTA = 36.07 ms [21:01:48] (03PS3) 10Jforrester: [WIP] Add composer test for coding standards, and try to pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271936 (https://phabricator.wikimedia.org/T162835) [21:02:25] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add composer test for coding standards, and try to pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271936 (https://phabricator.wikimedia.org/T162835) (owner: 10Jforrester) [21:02:57] PROBLEM - Recursive DNS on 208.80.153.12 is CRITICAL: CRITICAL - Plugin timed out while executing system call [21:04:12] ah it's intentional in the puppet manifests it seems [21:04:31] still, seems like there's a better way to do that, which keeps it underneath the "acamar" host entry [21:07:27] PROBLEM - Host 208.80.153.12 is DOWN: PING CRITICAL - Packet loss = 100% [21:08:11] bblack: or even have it use the same msg as the one with hostname rather than raw [21:10:53] it's for actual DNS resolution check with check_dns [21:11:26] and the monitoring::service needs to belong to some host so there is a virtual host added.. which is that IP... [21:11:30] looking how to avoid it [21:12:00] (03PS2) 10Thcipriani: Scap: Remove git_server from scap.cfg [software/prometheus_jmx_exporter] - 10https://gerrit.wikimedia.org/r/347924 (https://phabricator.wikimedia.org/T162814) [21:12:37] RECOVERY - Host 208.80.153.12 is UP: PING OK - Packet loss = 0%, RTA = 36.14 ms [21:13:07] RECOVERY - puppet last run on db2039 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [21:13:19] 06Operations, 10DNS, 10Traffic, 06Services (watching): icinga alerts on nodejs services when a recdns server is depooled - https://phabricator.wikimedia.org/T162818#3176556 (10mobrovac) [21:13:54] it's because it wants to monitor both IPv4 and IPv6 and a single host doesnt have 2 IPs from Icinga point of view [21:15:23] surely that can be hacked around with an icinga cmd template that templates out the -I part separately or something [21:15:32] I donno [21:20:08] i think i found something. we can let Icinga know that this host is a "child" of the other host [21:20:40] https://docs.icinga.com/latest/en/networkreachability.html#parentchildrelations [21:23:31] mutante thats the latest docs [21:23:33] for icinga 2.x [21:23:35] not 1.x [21:24:12] paladox: no, that's 1.x [21:24:25] paladox: see on https://docs.icinga.com/ lower right corner [21:24:32] oh i see now [21:24:34] "latest" is still 1 [21:24:42] hmm strange that they call latest 1.x [21:25:15] probably to point out how much of a breaking change it is [21:25:20] and that it's not compatible [21:25:31] oh [21:27:35] (03CR) 10Thcipriani: [C: 04-1] "Needs I27ed2e8989db8b45e7b3397d37b961065f606bee to merge first" [software/prometheus_jmx_exporter] - 10https://gerrit.wikimedia.org/r/347924 (https://phabricator.wikimedia.org/T162814) (owner: 10Thcipriani) [21:28:29] (03PS1) 10Dzahn: dnsrec/icinga: add child/parent rel between monitor hosts [puppet] - 10https://gerrit.wikimedia.org/r/347984 [21:28:43] PROBLEM - Host 2620:0:860:1:d6ae:52ff:feac:4dc8 is DOWN: PING CRITICAL - Packet loss = 100% [21:29:31] (03CR) 10jerkins-bot: [V: 04-1] dnsrec/icinga: add child/parent rel between monitor hosts [puppet] - 10https://gerrit.wikimedia.org/r/347984 (owner: 10Dzahn) [21:29:43] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [21:30:35] (03PS1) 10Dereckson: Enable Education Program on it.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347987 (https://phabricator.wikimedia.org/T162692) [21:31:55] !log Create Education Program tables on it.wikiversity (T162692) [21:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:04] T162692: Install Extension:Education_Program on it.wikiversity - https://phabricator.wikimedia.org/T162692 [21:34:43] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [21:39:53] PROBLEM - Host 208.80.153.12 is DOWN: PING CRITICAL - Packet loss = 100% [21:42:31] (03PS2) 10Dzahn: dnsrec/icinga: add child/parent rel between monitor hosts [puppet] - 10https://gerrit.wikimedia.org/r/347984 [21:43:16] (03PS1) 10Dereckson: Document Education Program task reference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347989 [21:44:17] jouncebot: refresh [21:44:21] I refreshed my knowledge about deployments. [21:45:03] RECOVERY - Host 208.80.153.12 is UP: PING OK - Packet loss = 0%, RTA = 36.05 ms [21:45:34] 06Operations, 13Patch-For-Review: Reimage achernar and acamar to jessie - https://phabricator.wikimedia.org/T155411#3176636 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['acamar.wikimedia.org'] ``` and were **ALL** successful. [21:47:07] (03PS3) 10Dzahn: dnsrec/icinga: add child/parent rel between monitor hosts [puppet] - 10https://gerrit.wikimedia.org/r/347984 [21:48:04] RECOVERY - Recursive DNS on 208.80.153.12 is OK: DNS OK: 0.085 seconds response time. www.wikipedia.org returns 208.80.154.224 [21:48:33] (03PS2) 10Volans: Move varnish puppet disabling in t00 [switchdc] - 10https://gerrit.wikimedia.org/r/347869 (https://phabricator.wikimedia.org/T160178) (owner: 10Giuseppe Lavagetto) [21:50:05] (03PS3) 10Dzahn: site/redis_slave: remove duplicate base::firewall include [puppet] - 10https://gerrit.wikimedia.org/r/347881 [21:53:27] (03PS1) 10Volans: Use a generic retry for the read only message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347992 (https://phabricator.wikimedia.org/T160178) [21:54:42] (03CR) 10Dzahn: Move varnish puppet disabling in t00 (031 comment) [switchdc] - 10https://gerrit.wikimedia.org/r/347869 (https://phabricator.wikimedia.org/T160178) (owner: 10Giuseppe Lavagetto) [22:03:10] (03CR) 10Jforrester: [C: 04-1] Use a generic retry for the read only message (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347992 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [22:03:13] (03CR) 10Dzahn: [C: 032] site/redis_slave: remove duplicate base::firewall include [puppet] - 10https://gerrit.wikimedia.org/r/347881 (owner: 10Dzahn) [22:04:59] (03PS1) 10Jcrespo: Revert "mysql-predump.erb: Reduce the number of jobs" [puppet] - 10https://gerrit.wikimedia.org/r/347996 [22:05:22] (03PS2) 10Volans: Use a generic retry for the read only message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347992 (https://phabricator.wikimedia.org/T160178) [22:07:04] (03CR) 10Volans: Use a generic retry for the read only message (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347992 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [22:07:32] (03CR) 10Jforrester: [C: 031] Use a generic retry for the read only message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347992 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [22:07:58] (03PS2) 10Dzahn: site/install: unify install nodes into a single regex [puppet] - 10https://gerrit.wikimedia.org/r/347862 [22:08:07] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/6146/" [puppet] - 10https://gerrit.wikimedia.org/r/347862 (owner: 10Dzahn) [22:09:12] (03PS3) 10Volans: Change the RO message we match [switchdc] - 10https://gerrit.wikimedia.org/r/347868 (https://phabricator.wikimedia.org/T160178) (owner: 10Giuseppe Lavagetto) [22:11:48] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [22:16:45] (03CR) 10Dzahn: [C: 032] site: remove all the "$cluster = 'misc'" [puppet] - 10https://gerrit.wikimedia.org/r/347867 (owner: 10Dzahn) [22:16:48] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [22:19:34] (03CR) 10Krinkle: [C: 031] Use a generic retry for the read only message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347992 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [22:20:10] (03CR) 10Krinkle: [C: 031] Change the RO message we match [switchdc] - 10https://gerrit.wikimedia.org/r/347868 (https://phabricator.wikimedia.org/T160178) (owner: 10Giuseppe Lavagetto) [22:21:19] (03PS2) 10Dzahn: site: remove all the "$cluster = 'misc'" [puppet] - 10https://gerrit.wikimedia.org/r/347867 [22:24:55] (03CR) 10Dzahn: [C: 032] site: remove all the "$cluster = 'misc'" [puppet] - 10https://gerrit.wikimedia.org/r/347867 (owner: 10Dzahn) [22:29:04] (03PS3) 10Volans: Move varnish puppet disabling in t00 [switchdc] - 10https://gerrit.wikimedia.org/r/347869 (https://phabricator.wikimedia.org/T160178) (owner: 10Giuseppe Lavagetto) [22:29:42] (03PS1) 10BBlack: Revert "LVS: remove direct use of acamar recdns" [puppet] - 10https://gerrit.wikimedia.org/r/348000 (https://phabricator.wikimedia.org/T155411) [22:29:44] (03PS1) 10BBlack: LVS: remove direct use of achernar recdns [puppet] - 10https://gerrit.wikimedia.org/r/348001 (https://phabricator.wikimedia.org/T155411) [22:30:08] (03CR) 10Dzahn: "no changes on bast1001, uranium (ganglia) or einsteinium (icinga)" [puppet] - 10https://gerrit.wikimedia.org/r/347867 (owner: 10Dzahn) [22:33:20] (03CR) 10Volans: Move varnish puppet disabling in t00 (031 comment) [switchdc] - 10https://gerrit.wikimedia.org/r/347869 (https://phabricator.wikimedia.org/T160178) (owner: 10Giuseppe Lavagetto) [22:37:35] (03CR) 10Dzahn: "yea, we found this today. the icinga-common package _does_ provide /etc/init.d/icinga nowadays. so good point per the FIXME which said the" [puppet] - 10https://gerrit.wikimedia.org/r/347899 (owner: 10Paladox) [22:40:35] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 3 others: Purge Varnish cache when a banner is saved - https://phabricator.wikimedia.org/T154954#3176850 (10DStrine) [22:41:21] (03CR) 10Dzahn: "so as you can see there are some differences and these need careful checking. for example we would not have "PURGESCRIPT="/usr/local/sbin/" [puppet] - 10https://gerrit.wikimedia.org/r/347899 (owner: 10Paladox) [22:42:16] (03CR) 10Dzahn: [C: 04-1] "i'm putting a -1 because it should not be merged like this, but it would still be a nice thing to fix the FIXME of course, if we can" [puppet] - 10https://gerrit.wikimedia.org/r/347899 (owner: 10Paladox) [22:42:33] 06Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 07HTTPS, 07JavaScript: Use Upgrade Insecure Requests on Wikimedia wikis - https://phabricator.wikimedia.org/T101002#3176879 (10Krinkle) >>! In T101002#2500137, @BBlack wrote: >>>! In T101002#1326438, @Krinkle wrote: >> This header currently results... [22:43:49] (03PS1) 10Awight: Split out retrieving globals and use a more machine-readable format [dumps] - 10https://gerrit.wikimedia.org/r/348002 [22:45:46] !log downtiming acamar again to fixup bios stuff (HT at least) [22:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:39] (03CR) 10jerkins-bot: [V: 04-1] Split out retrieving globals and use a more machine-readable format [dumps] - 10https://gerrit.wikimedia.org/r/348002 (owner: 10Awight) [22:47:04] (03PS3) 10Jforrester: Enable wgCiteResponsiveReferences on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344722 (https://phabricator.wikimedia.org/T161307) [22:47:22] (03PS2) 10Jforrester: Enable wgCiteResponsiveReferences on bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346333 (https://phabricator.wikimedia.org/T162145) [22:48:28] PROBLEM - Host 208.80.153.12 is DOWN: PING CRITICAL - Packet loss = 100% [22:48:38] PROBLEM - Host 2620:0:860:1:208:80:153:12 is DOWN: CRITICAL - Destination Unreachable (2620:0:860:1:208:80:153:12) [22:49:34] (03PS1) 10Niharika29: Update the LoginNotify config to match what would be going into prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348003 (https://phabricator.wikimedia.org/T162104) [22:50:55] !log acamar fixed up BIOS: HT disabled and power mgmt was set to PPW (DAPC) instead of PPW (OS) [22:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:18] RECOVERY - Host 208.80.153.12 is UP: PING OK - Packet loss = 0%, RTA = 36.26 ms [22:53:18] RECOVERY - Host 2620:0:860:1:208:80:153:12 is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms [22:55:08] PROBLEM - NTP peers on acamar is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [22:57:08] (03PS2) 10Jforrester: Increase default thumbnail display size from 220px to 300px [mediawiki-config] - 10https://gerrit.wikimedia.org/r/154408 [22:57:41] (03CR) 10Jforrester: [C: 04-2] "PS2: Manual rebase; however, this needs stewardship from Reading." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/154408 (owner: 10Jforrester) [23:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170412T2300). Please do the needful. [23:00:05] Jdlrobson, MatmaRex, and James_F: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:07] 06Operations, 10DNS, 10Traffic, 06Services (next): icinga alerts on nodejs services when a recdns server is depooled - https://phabricator.wikimedia.org/T162818#3176931 (10GWicke) [23:00:30] Heya. [23:00:34] * jdlrobson is here [23:00:44] hi [23:00:52] Hello [23:00:55] I can SWAT [23:01:17] (03CR) 10BBlack: [C: 032] Revert "LVS: remove direct use of acamar recdns" [puppet] - 10https://gerrit.wikimedia.org/r/348000 (https://phabricator.wikimedia.org/T155411) (owner: 10BBlack) [23:01:24] (03CR) 10BBlack: [C: 032] LVS: remove direct use of achernar recdns [puppet] - 10https://gerrit.wikimedia.org/r/348001 (https://phabricator.wikimedia.org/T155411) (owner: 10BBlack) [23:01:33] (03CR) 10Catrope: [C: 032] Update the LoginNotify config to match what would be going into prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348003 (https://phabricator.wikimedia.org/T162104) (owner: 10Niharika29) [23:02:14] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=acamar.wikimedia.org,dc=codfw,cluster=dns,service=pdns_recursor [23:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:48] (03Merged) 10jenkins-bot: Update the LoginNotify config to match what would be going into prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348003 (https://phabricator.wikimedia.org/T162104) (owner: 10Niharika29) [23:03:04] (03CR) 10jenkins-bot: Update the LoginNotify config to match what would be going into prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348003 (https://phabricator.wikimedia.org/T162104) (owner: 10Niharika29) [23:03:07] (03CR) 10Catrope: [C: 032] Enable wgCiteResponsiveReferences on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344722 (https://phabricator.wikimedia.org/T161307) (owner: 10Jforrester) [23:03:09] (03CR) 10Catrope: [C: 032] Enable wgCiteResponsiveReferences on bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346333 (https://phabricator.wikimedia.org/T162145) (owner: 10Jforrester) [23:03:14] (03PS2) 10Catrope: Wikitech: Don't try to do cross-wiki uploads to Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347422 (https://phabricator.wikimedia.org/T162374) (owner: 10Jforrester) [23:03:22] (03CR) 10Catrope: [C: 032] Wikitech: Don't try to do cross-wiki uploads to Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347422 (https://phabricator.wikimedia.org/T162374) (owner: 10Jforrester) [23:04:19] PROBLEM - NTP peers on acamar is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:04:57] (03PS2) 10Dzahn: site/prometheus: rm duplicate base::firewall, mv standard to role [puppet] - 10https://gerrit.wikimedia.org/r/347883 [23:05:29] PROBLEM - puppet last run on wtp2013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:05:57] Why didn't jouncebot ping me... [23:06:18] jouncebot: now [23:06:18] For the next 0 hour(s) and 53 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170412T2300) [23:06:26] Niharika: It seems like it refreshes infrequently. It also didn't ping me, but I added myself 2 mins before 4pm [23:06:44] Niharika: Also your thing is merged and is labs-only, so it should arrive in labs soonish through the automatic scap thing [23:06:47] Ah, I see. [23:06:52] is it "max 3 people"? [23:06:56] No. [23:07:00] ok [23:08:10] (03Merged) 10jenkins-bot: Enable wgCiteResponsiveReferences on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344722 (https://phabricator.wikimedia.org/T161307) (owner: 10Jforrester) [23:08:31] (03Merged) 10jenkins-bot: Enable wgCiteResponsiveReferences on bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346333 (https://phabricator.wikimedia.org/T162145) (owner: 10Jforrester) [23:09:08] RoanKattouw thanks! [23:12:08] RECOVERY - NTP peers on acamar is OK: NTP OK: Offset 0.021441 secs [23:12:28] 06Operations, 13Patch-For-Review: Miscellaneous servers to track in eqiad for possible inclusion in codfw misc virt cluster - https://phabricator.wikimedia.org/T88761#3176957 (10Dzahn) 05Open>03Resolved [23:12:30] 06Operations, 06Release-Engineering-Team, 05DC-Switchover-Prep-Q3-2016-17, 13Patch-For-Review: Understand the preparedness of misc services for datacenter switchover - https://phabricator.wikimedia.org/T156937#3176958 (10Dzahn) [23:13:52] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable wgCiteResponsiveReferences on cawiki (T161307) and bgwiki (T162145) (duration: 00m 44s) [23:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:01] T162145: Convert reference lists over to `responsive` on bgwiki - https://phabricator.wikimedia.org/T162145 [23:14:02] T161307: Convert reference lists over to `responsive` on ca.wiki - https://phabricator.wikimedia.org/T161307 [23:14:50] (03PS3) 10Catrope: Wikitech: Don't try to do cross-wiki uploads to Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347422 (https://phabricator.wikimedia.org/T162374) (owner: 10Jforrester) [23:14:56] (03CR) 10Catrope: Wikitech: Don't try to do cross-wiki uploads to Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347422 (https://phabricator.wikimedia.org/T162374) (owner: 10Jforrester) [23:15:01] (03CR) 10Catrope: [C: 032] Wikitech: Don't try to do cross-wiki uploads to Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347422 (https://phabricator.wikimedia.org/T162374) (owner: 10Jforrester) [23:15:40] !log bblack@neodymium conftool action : set/pooled=no; selector: name=achernar.wikimedia.org,dc=codfw,cluster=dns,service=pdns_recursor [23:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:13] (03Merged) 10jenkins-bot: Wikitech: Don't try to do cross-wiki uploads to Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347422 (https://phabricator.wikimedia.org/T162374) (owner: 10Jforrester) [23:20:32] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Disable cross-wiki uploads to Commons (T162374) (duration: 00m 43s) [23:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:42] T162374: Poor in-editor (WikiEditor & VE) upload disabled error message on wikitechwiki - https://phabricator.wikimedia.org/T162374 [23:20:43] 06Operations, 13Patch-For-Review: Reimage achernar and acamar to jessie - https://phabricator.wikimedia.org/T155411#3177039 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['achernar.wikimedia.org'] ``` The log can be found in `/var/log/wmf-auto-rei... [23:21:03] (03CR) 10jenkins-bot: Enable wgCiteResponsiveReferences on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344722 (https://phabricator.wikimedia.org/T161307) (owner: 10Jforrester) [23:21:05] (03CR) 10jenkins-bot: Enable wgCiteResponsiveReferences on bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346333 (https://phabricator.wikimedia.org/T162145) (owner: 10Jforrester) [23:21:07] (03CR) 10jenkins-bot: Wikitech: Don't try to do cross-wiki uploads to Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347422 (https://phabricator.wikimedia.org/T162374) (owner: 10Jforrester) [23:22:58] PROBLEM - Host 2620:0:860:2:208:80:153:42 is DOWN: PING CRITICAL - Packet loss = 100% [23:23:38] PROBLEM - Host 208.80.153.42 is DOWN: PING CRITICAL - Packet loss = 100% [23:23:58] that's the reimaging [23:24:43] this time I fixed bios settings during the reimager's reboot :) [23:24:51] heh :) [23:25:03] 06Operations, 10DNS, 10Traffic, 06Services (next): icinga alerts on nodejs services when a recdns server is depooled - https://phabricator.wikimedia.org/T162818#3177050 (10GWicke) There are big drops in *action* API backend requests and huge spikes in latency around both times: {F7514011} The same latenc... [23:27:24] MatmaRex, jdlrobson: Your changes are live on mwdebug1002, please test [23:28:44] RECOVERY - Host 208.80.153.42 is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms [23:28:51] RoanKattouw: looks good [23:29:24] RoanKattouw: mine looks good too (on enwiki) [23:30:54] 06Operations, 10DNS, 10Traffic, 06Services (next): icinga alerts on nodejs services when a recdns server is depooled - https://phabricator.wikimedia.org/T162818#3177056 (10BBlack) FWIW - I did the same depooling (for reinstalls) in codfw this afternoon, and there was no impact in that case. So this seems... [23:31:14] PROBLEM - Recursive DNS on 208.80.153.42 is CRITICAL: CRITICAL - Plugin timed out while executing system call [23:32:13] !log catrope@tin Synchronized php-1.29.0-wmf.19/resources/src/mediawiki.widgets: Fix setDisabled in mw.widgets.Complex* (T162667) (duration: 00m 44s) [23:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:22] T162667: Special:Move throws "TypeError: this.namespace is undefined" - https://phabricator.wikimedia.org/T162667 [23:33:19] (03PS1) 10Dzahn: import kmod puppet module to manage kernel modules [puppet] - 10https://gerrit.wikimedia.org/r/348009 [23:33:34] RECOVERY - puppet last run on wtp2013 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [23:34:02] jdlrobson, MatmaRex: Deployed everywhere now [23:34:10] Whoops, that's a lie [23:34:17] MatmaRex: Your thing is done for wmf19, doing yours in wmf20 now [23:34:20] jdlrobson: And then yours is next [23:35:13] !log catrope@tin Synchronized php-1.29.0-wmf.20/resources/src/mediawiki.widgets: Fix setDisabled in mw.widgets.Complex* (T162667) (duration: 00m 42s) [23:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:57] (03CR) 10jerkins-bot: [V: 04-1] import kmod puppet module to manage kernel modules [puppet] - 10https://gerrit.wikimedia.org/r/348009 (owner: 10Dzahn) [23:36:07] 06Operations: acpi_pad issues - https://phabricator.wikimedia.org/T162850#3177068 (10BBlack) [23:36:22] (03PS1) 10Awight: Prune nonexistent config files [dumps] - 10https://gerrit.wikimedia.org/r/348011 [23:36:25] 06Operations: acpi_pad issues - https://phabricator.wikimedia.org/T162850#3177084 (10BBlack) p:05Triage>03High [23:37:18] !log catrope@tin Synchronized php-1.29.0-wmf.20/extensions/MobileFrontend/: Log only infoboxes which are not a direct children of lead section (T149884) (duration: 01m 05s) [23:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:26] T149884: Log instances of infoboxes being wrapped in containers - https://phabricator.wikimedia.org/T149884 [23:37:30] (03CR) 10jerkins-bot: [V: 04-1] Prune nonexistent config files [dumps] - 10https://gerrit.wikimedia.org/r/348011 (owner: 10Awight) [23:37:32] MatmaRex, jdlrobson Actually done for real now [23:37:41] lol, when you import an official puppet module from the puppet forge, but our linter hates it [23:37:51] rubylint that is [23:37:54] rubocop [23:38:51] 06Operations: acpi_pad issues - https://phabricator.wikimedia.org/T162850#3177086 (10BBlack) modules/base/files/kernel/blacklist-wmf.conf is probably the place to try disabling this first, FWIW. [23:39:04] RoanKattouw: thanks [23:40:08] (03CR) 10Dzahn: "yea, rubocop hates it, but it's 100% as imported from the forge.. and i wasn't planning on changing it first. meh" [puppet] - 10https://gerrit.wikimedia.org/r/348009 (owner: 10Dzahn) [23:40:33] thanks RoanKattouw ! [23:42:05] (03Abandoned) 10Dzahn: import kmod puppet module to manage kernel modules [puppet] - 10https://gerrit.wikimedia.org/r/348009 (owner: 10Dzahn) [23:42:16] !log catrope@tin Started scap: Split RCFilters GuidedTour messages for ORES vs non-ORES (T162693) [23:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:24] T162693: Guided tour for the New Filters mentions ORES predictions on wikis where they are not available - https://phabricator.wikimedia.org/T162693 [23:43:09] 06Operations: acpi_pad issues - https://phabricator.wikimedia.org/T162850#3177068 (10Dzahn) i wanted to suggest importing the puppet module kmod (https://gerrit.wikimedia.org/r/#/c/348009/) to get kmod::blacklist, but as you point out we already have the blacklist above so i abandoned that again [23:43:13] (03PS2) 10Awight: Split out retrieving globals and use a more machine-readable format [dumps] - 10https://gerrit.wikimedia.org/r/348002 [23:43:15] (03PS2) 10Awight: Prune nonexistent config files [dumps] - 10https://gerrit.wikimedia.org/r/348011 [23:44:53] 06Operations: acpi_pad issues - https://phabricator.wikimedia.org/T162850#3177093 (10Dzahn) [23:54:48] (03PS1) 10Dzahn: base: blacklist acpi_pad kernel module [puppet] - 10https://gerrit.wikimedia.org/r/348016 [23:56:06] (03CR) 10jerkins-bot: [V: 04-1] base: blacklist acpi_pad kernel module [puppet] - 10https://gerrit.wikimedia.org/r/348016 (owner: 10Dzahn) [23:59:13] PROBLEM - Host 2620:0:860:2:d6ae:52ff:fead:5610 is DOWN: PING CRITICAL - Packet loss = 100% [23:59:58] (03CR) 10Awight: last page range for page content job would sometimes have too many revs (032 comments) [dumps] - 10https://gerrit.wikimedia.org/r/347627 (owner: 10ArielGlenn)