[00:01:36] <wikibugs>	 (03PS1) 10Bstorm: cloudstore: add to the script for syncserver [puppet] - 10https://gerrit.wikimedia.org/r/507227 (https://phabricator.wikimedia.org/T209527)
[00:02:43] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] cloudstore: add to the script for syncserver [puppet] - 10https://gerrit.wikimedia.org/r/507227 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[00:06:38] <wikibugs>	 (03PS3) 10CDanis: icinga: pause nsca on reloads [puppet] - 10https://gerrit.wikimedia.org/r/504898 (https://phabricator.wikimedia.org/T196336)
[00:07:19] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] icinga: pause nsca on reloads [puppet] - 10https://gerrit.wikimedia.org/r/504898 (https://phabricator.wikimedia.org/T196336) (owner: 10CDanis)
[00:12:51] <wikibugs>	 (03PS1) 10Bstorm: cloudstore: add to role for the syncing [puppet] - 10https://gerrit.wikimedia.org/r/507229 (https://phabricator.wikimedia.org/T209527)
[00:13:44] <wikibugs>	 (03PS1) 10Dzahn: admins: add Harumi Monroy to ldap_only_admins [puppet] - 10https://gerrit.wikimedia.org/r/507230 (https://phabricator.wikimedia.org/T222110)
[00:14:03] <wikibugs>	 (03PS1) 10Ayounsi: Add fake ssh keys for netbox network user [labs/private] - 10https://gerrit.wikimedia.org/r/507231
[00:14:08] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] cloudstore: add to role for the syncing [puppet] - 10https://gerrit.wikimedia.org/r/507229 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[00:14:13] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 101 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, number_of_in_flight_fetch: 0, number_of_data_nodes: 4, timed_out: False, unassigned_shards: 101, number_of_pending_tasks: 0, relocating_shards: 0, initializing_shards: 0, number_of_nodes: 4, status: red, delayed_unassigned_shards: 0, act
[00:14:13] <icinga-wm>	 t_as_number: 82.91032148900169, active_primary_shards: 182, active_shards: 490, task_max_waiting_in_queue_millis: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:14:17] <wikibugs>	 (03CR) 10BryanDavis: wikitech: Disable Gerrit accounts when blocked on wikitech (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506587 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis)
[00:14:29] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 101 threshold =0.15 breach: delayed_unassigned_shards: 0, task_max_waiting_in_queue_millis: 0, active_primary_shards: 182, status: red, timed_out: False, number_of_pending_tasks: 0, cluster_name: cloudelastic-chi-eqiad, number_of_data_nodes: 4, number_of_nodes: 4, number_of_in_flight_fetch: 0, initiali
[00:14:29] <icinga-wm>	 ctive_shards: 490, relocating_shards: 0, unassigned_shards: 101, active_shards_percent_as_number: 82.91032148900169 https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:16:23] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] admins: add Harumi Monroy to ldap_only_admins [puppet] - 10https://gerrit.wikimedia.org/r/507230 (https://phabricator.wikimedia.org/T222110) (owner: 10Dzahn)
[00:16:31] <wikibugs>	 (03PS2) 10Dzahn: admins: add Harumi Monroy to ldap_only_admins [puppet] - 10https://gerrit.wikimedia.org/r/507230 (https://phabricator.wikimedia.org/T222110)
[00:17:04] <wikibugs>	 (03PS1) 10Bstorm: cloudstore: fix one more mistake in syncserver [puppet] - 10https://gerrit.wikimedia.org/r/507232 (https://phabricator.wikimedia.org/T209527)
[00:17:07] <ebernhardson>	 cloudelastic alerts are expected, it's in the middle of creating all the indices for all the wikis
[00:17:59] <wikibugs>	 (03PS2) 10Ayounsi: Netbox: add j2nb support [puppet] - 10https://gerrit.wikimedia.org/r/507224
[00:18:48] <wikibugs>	 (03PS2) 10Bstorm: cloudstore: fix one more mistake in syncserver [puppet] - 10https://gerrit.wikimedia.org/r/507232 (https://phabricator.wikimedia.org/T209527)
[00:19:24] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] cloudstore: fix one more mistake in syncserver [puppet] - 10https://gerrit.wikimedia.org/r/507232 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[00:21:49] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1003 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: timed_out: False, number_of_nodes: 4, active_shards_percent_as_number: 85.3568800588668, relocating_shards: 0, status: red, active_primary_shards: 422, cluster_name: cloudelastic-chi-eqiad, active_shards: 1160, number_of_data_nodes: 4, task_max_waiting_in_queue_millis: 0, number_of_pending_tasks: 0
[00:21:49] <icinga-wm>	 ds: 199, delayed_unassigned_shards: 0, number_of_in_flight_fetch: 0, initializing_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:22:37] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1001 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: active_shards: 1247, unassigned_shards: 199, initializing_shards: 0, number_of_in_flight_fetch: 0, relocating_shards: 0, number_of_nodes: 4, timed_out: False, status: red, active_primary_shards: 451, active_shards_percent_as_number: 86.23789764868603, number_of_data_nodes: 4, number_of_pending_task
[00:22:37] <icinga-wm>	 e: cloudelastic-chi-eqiad, delayed_unassigned_shards: 0, task_max_waiting_in_queue_millis: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:50:16] <wikibugs>	 10Operations, 10cloud-services-team, 10serviceops: Change /r/p/ to /r/ on all hosts (where https://gerrit.wikimedia.org/r/p/ exists) - https://phabricator.wikimedia.org/T222093 (10Dzahn)
[00:57:24] <wikibugs>	 10Operations, 10Phabricator, 10Project-Admins, 10SRE-Access-Requests, and 2 others: Document how to convert projects into subprojects/milestones etc - https://phabricator.wikimedia.org/T221112 (10Dzahn)
[00:58:49] <wikibugs>	 10Operations, 10Phabricator, 10Project-Admins, 10SRE-Access-Requests, and 2 others: Document how to convert projects into subprojects/milestones etc - https://phabricator.wikimedia.org/T221112 (10Dzahn) >>! In T221112#5121789, @mmodell wrote: @Dzahn can you help me figure out how to allow @aklapper to run...
[01:02:57] <wikibugs>	 10Operations, 10Phabricator, 10Project-Admins, 10SRE-Access-Requests, and 2 others: Document how to convert projects into subprojects/milestones etc (sudo privileges for phab admins to run move_project script) - https://phabricator.wikimedia.org/T221112 (10Dzahn)
[01:12:59] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] Phab: Allow greg and aklapper to convert projects to subprojects/milestones [puppet] - 10https://gerrit.wikimedia.org/r/505667 (https://phabricator.wikimedia.org/T221112) (owner: 10Aklapper)
[01:22:33] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10EventBus, and 5 others: Create schema[12]00[12] (schema.svc.{eqiad,codfw}.wmnet) - https://phabricator.wikimedia.org/T219556 (10Dzahn) When adding a new type of server name please add them in these 2 places:  - wikitech https://wikitech.wikimedia.org/wiki/In...
[01:25:11] <wikibugs>	 (03PS5) 10Dzahn: zuul: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507070 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox)
[01:27:15] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] zuul: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507070 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox)
[01:30:00] <mutante>	 !log contint2001..then contint1001 - deleting /etc/zuul/wikimedia and letting puppet re-clone it (gerrit:507070) (T218844)
[01:30:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:30:04] <stashbot>	 T218844: Update Gerrit /r/p/ links to /r/ - https://phabricator.wikimedia.org/T218844
[01:35:07] <wikibugs>	 10Operations, 10cloud-services-team, 10serviceops: Change /r/p/ to /r/ on all hosts (where https://gerrit.wikimedia.org/r/p/ exists) - https://phabricator.wikimedia.org/T222093 (10Dzahn)
[01:35:32] <wikibugs>	 10Operations, 10cloud-services-team, 10serviceops: Change /r/p/ to /r/ on all hosts (where https://gerrit.wikimedia.org/r/p/ exists) - https://phabricator.wikimedia.org/T222093 (10Dzahn)
[01:44:15] <icinga-wm>	 RECOVERY - Host db1093 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms
[01:46:56] <icinga-wm>	 PROBLEM - mysqld processes on db1093 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[01:47:39] <icinga-wm>	 PROBLEM - MariaDB read only s6 on db1093 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[01:48:00] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s6 on db1093 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[01:48:04] <mutante>	 uhmm.. i got paged because of that 
[01:48:06] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s6 on db1093 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[01:48:15] <mutante>	 but also i think it's normal that they dont start up after reboot
[01:53:35] <cdanis>	 I do believe DBs not starting up on boot is normal and expected
[01:53:43] <wikibugs>	 (03PS1) 10Dzahn: depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507237
[01:53:53] <mutante>	 yes, it is. but i think we are still supposed to depool it
[01:53:59] <mutante>	 ?
[01:54:16] <shdubsh>	 I would assume so
[01:54:53] <wikibugs>	 (03CR) 10Gergő Tisza: wikitech: Disable Gerrit accounts when blocked on wikitech (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506587 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis)
[01:54:56] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507237 (owner: 10Dzahn)
[01:56:44] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on db1093 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[02:01:36] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507237 (owner: 10Dzahn)
[02:02:03] <tgr>	 ^ doing an emergency mw-config deploy
[02:02:06] <mutante>	 thanks tgr!
[02:02:20] <mutante>	 making a DBA ticket for it
[02:02:44] <wikibugs>	 (03Merged) 10jenkins-bot: depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507237 (owner: 10Dzahn)
[02:04:34] <tgr>	 mutante: I suppose this is not testable on mwdebug?
[02:06:33] <mutante>	 tgr: no, i don't think so. but we have "Once the change is deployed, we should be able to see our change on: https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php "
[02:06:59] <mutante>	 well..that is after the fact
[02:07:40] <mutante>	 this is the official example from the docs how a diff should look:
[02:07:41] <mutante>	 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/447984/1/wmf-config/db-eqiad.php
[02:09:34] <logmsgbot>	 !log tgr@deploy1001 Synchronized wmf-config/db-eqiad.php: SWAT: [[gerrit:507237|depool db1093]] (duration: 00m 54s)
[02:09:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:10:13] <wikibugs>	 (03CR) 10jenkins-bot: depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507237 (owner: 10Dzahn)
[02:10:31] <icinga-wm>	 PROBLEM - HP RAID on db1093 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[02:10:33] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on db1093 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T222128 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[02:10:38] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on db1093 - https://phabricator.wikimedia.org/T222128 (10ops-monitoring-bot)
[02:10:44] <mutante>	 and that RAID issue is why it went down.. i think
[02:10:55] <mutante>	 there were lots of alerts in SOFT state earlier
[02:11:13] <mutante>	 now it finally went from SOFT to HARD
[02:12:23] <mutante>	 i DO see the changes now on https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php  confirmed
[02:13:49] <tgr>	 connection errors are gone
[02:13:55] <mutante>	 tgr: :) thanks
[02:13:56] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on db1093 - https://phabricator.wikimedia.org/T222128 (10Dzahn) This host started paging us for being rebooted a little before this ticket has been created. It is already depooled. -> T222127
[02:15:20] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Slave IO: s6 on db1093 is CRITICAL: CRITICAL slave_io_state could not connect daniel_zahn depooled - https://phabricator.wikimedia.org/T222127 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[02:15:21] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Slave Lag: s6 on db1093 is CRITICAL: CRITICAL slave_sql_lag could not connect daniel_zahn depooled - https://phabricator.wikimedia.org/T222127 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[02:15:22] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Slave SQL: s6 on db1093 is CRITICAL: CRITICAL slave_sql_state could not connect daniel_zahn depooled - https://phabricator.wikimedia.org/T222127 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[02:15:22] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB read only s6 on db1093 is CRITICAL: Could not connect to localhost:3306 daniel_zahn depooled - https://phabricator.wikimedia.org/T222127 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[02:15:23] <icinga-wm>	 ACKNOWLEDGEMENT - mysqld processes on db1093 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld daniel_zahn depooled - https://phabricator.wikimedia.org/T222127 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[02:16:41] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1093 - https://phabricator.wikimedia.org/T222128 (10Dzahn)
[02:17:56] <wikibugs>	 (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T222127 , https://phabricator.wikimedia.org/T222128" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507237 (owner: 10Dzahn)
[02:19:43] <wikibugs>	 10Operations, 10ops-eqiad: kafka1023 correctable memory errors - https://phabricator.wikimedia.org/T194249 (10Dzahn) showing up in icinga as a new issue:  https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=kafka1023&service=Memory+correctable+errors+-EDAC-
[02:20:04] <icinga-wm>	 ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on kafka1023 is CRITICAL: 4.001 ge 4 daniel_zahn https://phabricator.wikimedia.org/T194249 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=kafka1023&var-datasource=eqiad+prometheus/ops
[02:21:29] <cdanis>	 back at an actual computer now, anything I can help with shdubsh mutante tgr ?
[02:21:59] <mutante>	 cdanis: thank you, i think it's done. followed the DBA docs to depool it and make a ticket
[02:22:17] <cdanis>	 👍
[02:22:21] <icinga-wm>	 RECOVERY - Check systemd state on analytics1050 is OK: OK - running: The system is fully operational
[02:23:29] <mutante>	 !log analytics1050 - systemctl start mclog ... it was failed like recently on analytics1052 (T212219 ?)
[02:23:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:23:34] <stashbot>	 T212219: wmf-auto-restart fails on certain legacy services - https://phabricator.wikimedia.org/T212219
[02:24:02] <mutante>	 and with that Icinga looks clean again and i'm out 
[02:25:03] <mutante>	 cya tomorrow
[02:29:07] <icinga-wm>	 PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[02:29:15] <icinga-wm>	 PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100%
[02:30:23] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 35, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:31:53] <icinga-wm>	 RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 1166 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[02:33:01] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:34:29] <icinga-wm>	 RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.21 ms
[02:34:37] <icinga-wm>	 RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms
[02:39:33] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 35, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:41:15] <icinga-wm>	 PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: CRITICAL - Destination Unreachable (2607:f6f0:205::153)
[02:42:11] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:42:35] <wikibugs>	 (03CR) 10Zoranzoki21: [C: 03+1] Set wgArticleCountMethod='any' for bgwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506943 (https://phabricator.wikimedia.org/T222044) (owner: 10Ammarpad)
[02:43:33] <wikibugs>	 (03CR) 10Zoranzoki21: [C: 03+1] "Should be ok..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506892 (https://phabricator.wikimedia.org/T222024) (owner: 10DannyS712)
[02:46:39] <icinga-wm>	 RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.36 ms
[02:50:40] <wikibugs>	 (03CR) 10Zoranzoki21: [C: 03+1] Add 'Author' namespace in Sanskrit Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486221 (https://phabricator.wikimedia.org/T214553) (owner: 10Ammarpad)
[02:51:27] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 16912696 and 0 seconds
[02:52:47] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 222896 and 71 seconds
[03:10:39] <icinga-wm>	 PROBLEM - puppet last run on conf1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:26:44] <wikibugs>	 10Operations, 10TechCom, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Later), and 6 others: Establish an SLA for session storage - https://phabricator.wikimedia.org/T211721 (10aaron) >>! In T211721#5009838, @aaron wrote: > The SET metric for redis is very slow, so...
[03:42:29] <icinga-wm>	 RECOVERY - puppet last run on conf1005 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[03:50:19] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[03:55:29] <icinga-wm>	 PROBLEM - puppet last run on an-coord1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:55:31] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2013 is OK: OK - running: The system is fully operational
[03:55:55] <icinga-wm>	 PROBLEM - puppet last run on analytics1070 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:03:13] <icinga-wm>	 PROBLEM - puppet last run on mw1309 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:22:29] <icinga-wm>	 RECOVERY - puppet last run on analytics1070 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[04:26:25] <mutante>	 !log LDAP - remove user pirroh from group nda (T222085 and cross-validate-accounts demands consistency)
[04:26:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:26:29] <stashbot>	 T222085: Revoke @pirroh's shell access - https://phabricator.wikimedia.org/T222085
[04:27:23] <icinga-wm>	 RECOVERY - puppet last run on an-coord1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[04:28:32] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] registryha: feat: introducing LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/506367 (https://phabricator.wikimedia.org/T221101) (owner: 10Fsero)
[04:30:33] <icinga-wm>	 PROBLEM - puppet last run on rdb1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:31:59] <wikibugs>	 10Operations, 10Wikimedia-Site-requests, 10acl*stewards: Create accounts for new stewards in closed wikis - https://phabricator.wikimedia.org/T222117 (10kolbert) I agree that there should be some established procedure for cases where it becomes apparent action needs to be taken on a closed wiki.   @Base are...
[04:34:26] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: confctl: Add filter_objects and update_objects [software/spicerack] - 10https://gerrit.wikimedia.org/r/503946
[04:34:28] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: confctl: add change_and_revert contextmanager [software/spicerack] - 10https://gerrit.wikimedia.org/r/504578
[04:34:30] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947
[04:34:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto)
[04:35:03] <icinga-wm>	 RECOVERY - puppet last run on mw1309 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[04:37:41] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: confctl: Add filter_objects and update_objects [software/spicerack] - 10https://gerrit.wikimedia.org/r/503946
[04:37:43] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: confctl: add change_and_revert contextmanager [software/spicerack] - 10https://gerrit.wikimedia.org/r/504578
[04:37:45] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947
[04:41:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] confctl: Add filter_objects and update_objects [software/spicerack] - 10https://gerrit.wikimedia.org/r/503946 (owner: 10Giuseppe Lavagetto)
[04:41:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto)
[04:41:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] confctl: add change_and_revert contextmanager [software/spicerack] - 10https://gerrit.wikimedia.org/r/504578 (owner: 10Giuseppe Lavagetto)
[04:44:17] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/503946 (owner: 10Giuseppe Lavagetto)
[04:48:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] confctl: Add filter_objects and update_objects [software/spicerack] - 10https://gerrit.wikimedia.org/r/503946 (owner: 10Giuseppe Lavagetto)
[04:49:13] <_joe_>	 uhm I really don't get this
[04:51:03] <_joe_>	 and this was the rebase, indeed
[04:53:52] <onimisionipe>	 The tests won't run
[04:54:08] <onimisionipe>	 Volans said something about fixing it yesterday
[04:54:46] <onimisionipe>	 Had the same issue last week
[04:57:01] <icinga-wm>	 RECOVERY - puppet last run on rdb1010 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[04:58:14] <_joe_>	 how did this happen, I could dig deeper
[05:02:11] <_joe_>	 oh I see now running tests locally
[05:04:52] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1093 - https://phabricator.wikimedia.org/T222128 (10Marostegui) Per that output, looks like the BBU is gone, let's follow the investigation at {T222127}
[05:04:53] <wikibugs>	 (03PS1) 10Marostegui: db1093: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/507241 (https://phabricator.wikimedia.org/T222127)
[05:05:07] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1093 - https://phabricator.wikimedia.org/T222128 (10Marostegui)
[05:08:27] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1093: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/507241 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui)
[05:18:12] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1093 (s6 candidate master) went down - broken BBU - https://phabricator.wikimedia.org/T222127 (10Marostegui)
[05:22:00] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Clarify db1093 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507242 (https://phabricator.wikimedia.org/T222127)
[05:24:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Clarify db1093 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507242 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui)
[05:24:37] <wikibugs>	 (03PS1) 10Marostegui: db2045: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/507243 (https://phabricator.wikimedia.org/T219493)
[05:25:07] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Clarify db1093 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507242 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui)
[05:26:27] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Clarify db1093's status (duration: 00m 55s)
[05:26:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:27:26] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Clarify db1093's status (duration: 00m 51s)
[05:27:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:31:41] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1093 (s6 candidate master) went down - broken BBU - https://phabricator.wikimedia.org/T222127 (10Marostegui) I have started MySQL which started correctly. As it started fine, I have started replication too, once it has caught up, I am going to do a da...
[05:35:01] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Clarify db1093 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507242 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui)
[05:38:21] <wikibugs>	 (03PS3) 10Elukey: admin: allow analytics-admins to control jupyter user units [puppet] - 10https://gerrit.wikimedia.org/r/504067
[05:40:08] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin: allow analytics-admins to control jupyter user units [puppet] - 10https://gerrit.wikimedia.org/r/504067 (owner: 10Elukey)
[05:40:27] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2045: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/507243 (https://phabricator.wikimedia.org/T219493) (owner: 10Marostegui)
[05:40:29] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "This has been approved by the SRE team meeting." [puppet] - 10https://gerrit.wikimedia.org/r/504067 (owner: 10Elukey)
[05:40:34] <wikibugs>	 (03PS2) 10Marostegui: db2045: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/507243 (https://phabricator.wikimedia.org/T219493)
[05:41:38] <marostegui>	 elukey: good to merge your change?
[05:42:30] <elukey>	 yep!
[05:42:33] <elukey>	 thanks :)
[05:42:45] <marostegui>	 merging!
[06:29:53] <icinga-wm>	 PROBLEM - puppet last run on an-worker1084 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/run-puppet-agent]
[06:30:59] <icinga-wm>	 PROBLEM - puppet last run on cloudservices1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:45:56] <wikibugs>	 (03PS3) 10Matthias Mullie: Allow cross-site requests from mobile domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506151 (https://phabricator.wikimedia.org/T221734)
[06:56:11] <icinga-wm>	 RECOVERY - puppet last run on an-worker1084 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:16] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1093 (s6 candidate master) went down - broken BBU - https://phabricator.wikimedia.org/T222127 (10Marostegui) For what is worth, the LB looks like it worked fine. The time line is:  23:24: db1093 goes down 23:24-23:30: Spike of errors and then some res...
[06:57:19] <icinga-wm>	 RECOVERY - puppet last run on cloudservices1004 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures
[07:24:08] <marostegui>	 !log Remove labservices1001 and labservices1002 from tendril T221857 
[07:24:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:24:12] <stashbot>	 T221857: Decommission labservices1001, 1002 - https://phabricator.wikimedia.org/T221857
[07:29:06] <moritzm>	 !log installing systemd updates for jessie
[07:29:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:29:29] <wikibugs>	 10Operations, 10SRE-Access-Requests: Allow analytics-admins to control jupyter user units - https://phabricator.wikimedia.org/T222087 (10elukey) 05Open→03Resolved p:05Triage→03Normal
[07:39:45] <wikibugs>	 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Revoke @pirroh's shell access - https://phabricator.wikimedia.org/T222085 (10MoritzMuehlenhoff) @RStallman-legalteam : The access for Michele Cataste has been removed, can you please also update the NDA tracking meta data?
[07:41:08] <wikibugs>	 (03PS1) 10Elukey: Add the dfs.namenode.handler.count HDFS option [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507250 (https://phabricator.wikimedia.org/T220702)
[07:45:16] <wikibugs>	 (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/16179/an-master1001.eqiad.wmnet/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507250 (https://phabricator.wikimedia.org/T220702) (owner: 10Elukey)
[07:46:06] <wikibugs>	 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Revoke @pirroh's shell access - https://phabricator.wikimedia.org/T222085 (10MoritzMuehlenhoff) @Dzahn Best to use https://wikitech.wikimedia.org/wiki/Ops_Offboarding#Remove_user_from_privileged_groups ; it will prepare an LDIF to drop al...
[07:50:06] <wikibugs>	 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Performance-Team (Radar): Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10Gilles) It seems to be breaking navtiming (coal is fine, though):  ` Apr 30 07:43:37 webperf1001 python[5681]: 2019-04-30 07:43:37,515 [...
[07:53:34] <logmsgbot>	 !log gilles@deploy1001 Started deploy [performance/navtiming@e900152]: T221848 add more logging around startup
[07:53:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:38] <stashbot>	 T221848: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848
[07:53:39] <logmsgbot>	 !log gilles@deploy1001 Finished deploy [performance/navtiming@e900152]: T221848 add more logging around startup (duration: 00m 05s)
[07:53:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:11:05] <logmsgbot>	 !log gilles@deploy1001 Started deploy [performance/navtiming@8f135ac]: T221848 Defalt to partition 0 when no partition is found
[08:11:05] <logmsgbot>	 !log gilles@deploy1001 deploy aborted: T221848 Defalt to partition 0 when no partition is found (duration: 00m 00s)
[08:11:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:11:09] <stashbot>	 T221848: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848
[08:11:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:11:16] <logmsgbot>	 !log gilles@deploy1001 Started deploy [performance/navtiming@8f135ac]: T221848 Default to partition 0 when no partition is found
[08:11:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:11:21] <logmsgbot>	 !log gilles@deploy1001 Finished deploy [performance/navtiming@8f135ac]: T221848 Default to partition 0 when no partition is found (duration: 00m 05s)
[08:11:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:26] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add the dfs.namenode.handler.count HDFS option [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507250 (https://phabricator.wikimedia.org/T220702) (owner: 10Elukey)
[08:17:02] <wikibugs>	 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 2 others: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10Gilles) Fixed navtiming for now, I'll investigate further to make sure that this is a proper fix and not a hack. Right now I'm not sure that the metadat...
[08:18:32] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10Pablo-WMDE) @mobrovac During T221755 & T221754 we tended to [[ https://ssr-termbox.wmflabs.org/?spec | `/?spec` ]] and [[ https...
[08:19:04] <wikibugs>	 (03PS1) 10Elukey: Make dfs_namenode_handler_count optional [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507257 (https://phabricator.wikimedia.org/T220702)
[08:20:30] <wikibugs>	 (03PS2) 10Elukey: Make dfs_namenode_handler_count optional [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507257 (https://phabricator.wikimedia.org/T220702)
[08:21:47] <wikibugs>	 (03PS3) 10Elukey: Make dfs_namenode_handler_count optional [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507257 (https://phabricator.wikimedia.org/T220702)
[08:22:36] <godog>	 !log bounce prometheus on bast4002 after backfill has finished - T187987
[08:22:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:40] <stashbot>	 T187987: 100% of Prometheus traffic served by Prometheus v2 - https://phabricator.wikimedia.org/T187987
[08:30:10] <wikibugs>	 (03PS4) 10Elukey: Make dfs_namenode_handler_count optional [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507257 (https://phabricator.wikimedia.org/T220702)
[08:31:37] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Make dfs_namenode_handler_count optional [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507257 (https://phabricator.wikimedia.org/T220702) (owner: 10Elukey)
[08:32:21] <icinga-wm>	 PROBLEM - puppet last run on cobalt is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools]
[08:34:45] <wikibugs>	 (03PS1) 10Elukey: hadoop: raise dfs.namenode.handler.count from 10 to 80 [puppet] - 10https://gerrit.wikimedia.org/r/507259 (https://phabricator.wikimedia.org/T220702)
[08:37:50] <wikibugs>	 (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/16183/" [puppet] - 10https://gerrit.wikimedia.org/r/507259 (https://phabricator.wikimedia.org/T220702) (owner: 10Elukey)
[08:45:52] <wikibugs>	 (03CR) 10Joal: "One comment on the value, except from that looks good" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507259 (https://phabricator.wikimedia.org/T220702) (owner: 10Elukey)
[08:50:27] <wikibugs>	 (03CR) 10Elukey: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507259 (https://phabricator.wikimedia.org/T220702) (owner: 10Elukey)
[08:51:15] <wikibugs>	 (03PS2) 10Elukey: hadoop: raise dfs.namenode.handler.count from 10 to 80 [puppet] - 10https://gerrit.wikimedia.org/r/507259 (https://phabricator.wikimedia.org/T220702)
[08:52:10] <wikibugs>	 (03PS1) 10Ema: Revert "package_builder: move lintian out of require_package" [puppet] - 10https://gerrit.wikimedia.org/r/507262 (https://phabricator.wikimedia.org/T221784)
[08:52:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: 10M max-samples for all instances [puppet] - 10https://gerrit.wikimedia.org/r/507210 (https://phabricator.wikimedia.org/T222105) (owner: 10CDanis)
[08:55:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "On deployment: puppet will restart prometheus instances due to this, thus we'll need to do a controlled rollout" [puppet] - 10https://gerrit.wikimedia.org/r/507210 (https://phabricator.wikimedia.org/T222105) (owner: 10CDanis)
[08:55:39] <wikibugs>	 (03PS8) 10Fsero: registryha: feat: introducing LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/506367 (https://phabricator.wikimedia.org/T221101)
[08:57:07] <wikibugs>	 (03CR) 10Ema: [C: 03+2] Revert "package_builder: move lintian out of require_package" [puppet] - 10https://gerrit.wikimedia.org/r/507262 (https://phabricator.wikimedia.org/T221784) (owner: 10Ema)
[08:57:18] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] hadoop: raise dfs.namenode.handler.count from 10 to 80 [puppet] - 10https://gerrit.wikimedia.org/r/507259 (https://phabricator.wikimedia.org/T220702) (owner: 10Elukey)
[08:57:26] <wikibugs>	 (03PS3) 10Elukey: hadoop: raise dfs.namenode.handler.count from 10 to 80 [puppet] - 10https://gerrit.wikimedia.org/r/507259 (https://phabricator.wikimedia.org/T220702)
[08:58:49] <icinga-wm>	 RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[09:02:26] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1093 (s6 candidate master) went down - broken BBU - https://phabricator.wikimedia.org/T222127 (10Marostegui) The following tables have been checked against multiple hosts and reported no differences: ` archive logging page revision text user change_ta...
[09:02:50] <elukey>	 !log roll restart hdfs namenodes on an-master100[1,2] to pick up new settings - T220702
[09:02:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:54] <stashbot>	 T220702: Check if HDFS offers a way to prevent/limit/throttle users to overwhelm the HDFS Namenode - https://phabricator.wikimedia.org/T220702
[09:04:20] <ema>	 volans, moritzm: I've introduced a dependency cycle to test T221784, puppet failed on boron at 08:58. Still no trace of alerts in icinga 
[09:04:21] <stashbot>	 T221784: Puppet failing without Icinga alert in case of dependency cycle - https://phabricator.wikimedia.org/T221784
[09:05:05] <wikibugs>	 (03CR) 10Effie Mouzeli: [V: 03+1] "LGTM https://puppet-compiler.wmflabs.org/compiler1002/16185/webperf1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/506626 (https://phabricator.wikimedia.org/T221401) (owner: 10Gilles)
[09:05:15] <volans>	 ema: in the middle of something else, I can have a look in a few
[09:05:28] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] setup.py: force urllib3 version [software/spicerack] - 10https://gerrit.wikimedia.org/r/506966 (owner: 10Volans)
[09:05:28] <ema>	 volans: sure, no rush!
[09:09:57] <moritzm>	 ema: I forced the puppet check and in fact it believes puppet is running fine: "OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures"
[09:10:04] <moritzm>	 but a manual run on boron in fact fails
[09:10:33] <moritzm>	 Puppet even fails to fail properly :-)
[09:11:04] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Give some traffic to db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507263 (https://phabricator.wikimedia.org/T222127)
[09:11:45] <ema>	 moritzm: https://twitter.com/HackerNewsOnion/status/1118542182842085376
[09:11:51] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] cookbook API: add class API (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans)
[09:12:16] <wikibugs>	 (03CR) 10Jcrespo: "Structurally this makes way more sense, I am ok with the philosophy, but need to check it doesn't break anything as firewall changes can b" [puppet] - 10https://gerrit.wikimedia.org/r/505407 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk)
[09:15:19] <wikibugs>	 (03PS1) 10Marostegui: db1093: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/507264 (https://phabricator.wikimedia.org/T222127)
[09:15:21] <wikibugs>	 (03CR) 10Marostegui: "jcrespo maybe you want to push this Thursday instead of today given that tomorrow is bank holiday and maybe we want to leave this host run" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507263 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui)
[09:15:23] <wikibugs>	 (03CR) 10Marostegui: "jcrespo maybe you want to push this Thursday instead of today given that tomorrow is bank holiday and maybe we want to leave this host run" [puppet] - 10https://gerrit.wikimedia.org/r/507264 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui)
[09:15:25] <wikibugs>	 (03PS2) 10Ema: Add profile::cache::varnish::frontend::text [puppet] - 10https://gerrit.wikimedia.org/r/507022 (https://phabricator.wikimedia.org/T219967)
[09:18:28] <wikibugs>	 (03CR) 10Ema: [C: 03+2] Add profile::cache::varnish::frontend::text [puppet] - 10https://gerrit.wikimedia.org/r/507022 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema)
[09:19:38] <wikibugs>	 (03CR) 10Gehel: "minor style issue, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506378 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe)
[09:20:57] <wikibugs>	 (03CR) 10Volans: [C: 03+2] setup.py: force urllib3 version [software/spicerack] - 10https://gerrit.wikimedia.org/r/506966 (owner: 10Volans)
[09:23:01] <wikibugs>	 (03PS9) 10Fsero: registryha: feat: introducing LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/506367 (https://phabricator.wikimedia.org/T221101)
[09:23:53] <wikibugs>	 10Operations, 10cloud-services-team: labtestpuppetmaster2001: Broken package state; would upgrade to sysvinit - https://phabricator.wikimedia.org/T222148 (10MoritzMuehlenhoff)
[09:25:39] <wikibugs>	 (03CR) 10Fsero: [C: 03+2] registryha: feat: introducing LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/506367 (https://phabricator.wikimedia.org/T221101) (owner: 10Fsero)
[09:26:02] <wikibugs>	 (03Merged) 10jenkins-bot: setup.py: force urllib3 version [software/spicerack] - 10https://gerrit.wikimedia.org/r/506966 (owner: 10Volans)
[09:26:20] <fsero>	 !log creating lvs endpoints for docker registry - T221101
[09:26:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:26:24] <stashbot>	 T221101: migrate endpoint from old registry instance to new one - https://phabricator.wikimedia.org/T221101
[09:26:39] <wikibugs>	 10Operations, 10cloud-services-team: labpuppetmaster/labtestpuppetmaster2001: Broken package state; would upgrade to sysvinit - https://phabricator.wikimedia.org/T222148 (10MoritzMuehlenhoff) Also applies to labpuppetmaster*
[09:26:55] <wikibugs>	 (03CR) 10jenkins-bot: setup.py: force urllib3 version [software/spicerack] - 10https://gerrit.wikimedia.org/r/506966 (owner: 10Volans)
[09:26:58] <wikibugs>	 10Operations, 10cloud-services-team: labpuppetmaster/labtestpuppetmaster2001: Broken package state; would upgrade to sysvinit - https://phabricator.wikimedia.org/T222148 (10MoritzMuehlenhoff)
[09:31:22] <logmsgbot>	 !log fsero@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=docker-registry,service=docker-registry
[09:31:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:18] <wikibugs>	 (03PS1) 10Ema: cache: reimage cp4022 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/507266 (https://phabricator.wikimedia.org/T219967)
[09:33:11] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "A few more comments inline" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe)
[09:33:46] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "+1 but let's wait a bit." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507263 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui)
[09:34:12] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] db1093: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/507264 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui)
[09:34:55] <wikibugs>	 (03CR) 10Marostegui: "Let's merge this once it gets pooled, no need to page if it is not pooled for now" [puppet] - 10https://gerrit.wikimedia.org/r/507264 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui)
[09:35:59] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.44:443]) https://wikitech.wikimedia.org/wiki/PyBal
[09:36:09] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1006 is CRITICAL: CRITICAL: 47 connections established with conf1004.eqiad.wmnet:4001 (min=48) https://wikitech.wikimedia.org/wiki/PyBal
[09:36:16] <jijiki>	 ^ this is expected 
[09:36:31] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.44:443]) https://wikitech.wikimedia.org/wiki/PyBal
[09:37:09] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2006 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.44:443]) https://wikitech.wikimedia.org/wiki/PyBal
[09:37:19] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 47 connections established with conf1004.eqiad.wmnet:4001 (min=48) https://wikitech.wikimedia.org/wiki/PyBal
[09:39:12] <_joe_>	 lvs2006 should recover
[09:39:54] <_joe_>	 I see that ip in ipvsadm
[09:40:59] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs2003 is CRITICAL: CRITICAL: 38 connections established with conf2001.codfw.wmnet:2379 (min=39) https://wikitech.wikimedia.org/wiki/PyBal
[09:41:03] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2003 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.44:443]) https://wikitech.wikimedia.org/wiki/PyBal
[09:42:27] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[09:43:28] <wikibugs>	 10Operations, 10monitoring, 10Wikimedia-Incident: prometheus: some sort of IRC alerts on restarts? - https://phabricator.wikimedia.org/T222108 (10fgiunchedi) Checking process uptime sounds good to me, if I understood correctly (the one-time icinga notifcation) the alert would self-recover once uptime is no l...
[09:43:49] <wikibugs>	 (03PS2) 10Ema: cache: reimage cp4022 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/507266 (https://phabricator.wikimedia.org/T219967)
[09:43:52] <wikibugs>	 (03PS1) 10Ema: cache: add ulsfo_ats to cacheproxy::cron_restart [puppet] - 10https://gerrit.wikimedia.org/r/507267 (https://phabricator.wikimedia.org/T219967)
[09:46:08] <wikibugs>	 10Operations, 10monitoring, 10Wikimedia-Incident: figure out why Kafka dashboard hammers Prometheus, and fix it - https://phabricator.wikimedia.org/T222112 (10fgiunchedi) +1! I'm expecting the most effective mitigation to be recording rules, followed by loading less panels
[09:46:37] <wikibugs>	 10Operations, 10cloud-services-team: labpuppetmaster/labtestpuppetmaster2001: Broken package state; would upgrade to sysvinit - https://phabricator.wikimedia.org/T222148 (10MoritzMuehlenhoff) Same for labstore1004/1005
[09:46:52] <wikibugs>	 10Operations, 10cloud-services-team: labstore100[45]/labpuppetmaster/labtestpuppetmaster2001: Broken package state; would upgrade to sysvinit - https://phabricator.wikimedia.org/T222148 (10MoritzMuehlenhoff)
[09:47:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/506400 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond)
[09:51:35] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs2003 is OK: OK: 39 connections established with conf2001.codfw.wmnet:2379 (min=39) https://wikitech.wikimedia.org/wiki/PyBal
[09:51:39] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2003 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[09:51:51] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[09:52:05] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1006 is OK: OK: 48 connections established with conf1004.eqiad.wmnet:4001 (min=48) https://wikitech.wikimedia.org/wiki/PyBal
[09:53:57] <wikibugs>	 (03PS2) 10Muehlenhoff: Ignore libpng for nginx service restarts [puppet] - 10https://gerrit.wikimedia.org/r/507020
[09:54:09] <wikibugs>	 (03PS3) 10Muehlenhoff: Ignore libpng for nginx service restarts [puppet] - 10https://gerrit.wikimedia.org/r/507020
[09:56:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Ignore libpng for nginx service restarts [puppet] - 10https://gerrit.wikimedia.org/r/507020 (owner: 10Muehlenhoff)
[09:57:43] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[09:57:52] <wikibugs>	 (03PS2) 10Muehlenhoff: Move Kerberos Hiera settings to global setting [puppet] - 10https://gerrit.wikimedia.org/r/506647
[09:58:17] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:58:27] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 48 connections established with conf1004.eqiad.wmnet:4001 (min=48) https://wikitech.wikimedia.org/wiki/PyBal
[09:58:56] <wikibugs>	 (03PS1) 10ArielGlenn: split up page content jobs with max bytes per page range [dumps] - 10https://gerrit.wikimedia.org/r/507268 (https://phabricator.wikimedia.org/T221504)
[09:59:46] <wikibugs>	 (03PS2) 10ArielGlenn: split up page content jobs with max bytes per page range [dumps] - 10https://gerrit.wikimedia.org/r/507268 (https://phabricator.wikimedia.org/T221504)
[10:00:04] <apergos>	 bah I hoped I'd ^c before the first one went. oh well
[10:03:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Move Kerberos Hiera settings to global setting [puppet] - 10https://gerrit.wikimedia.org/r/506647 (owner: 10Muehlenhoff)
[10:05:01] <wikibugs>	 (03PS14) 10Jcrespo: mariadb-snapshots: Setup full daily snapshots for all codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/500980 (https://phabricator.wikimedia.org/T206203)
[10:05:03] <wikibugs>	 (03PS12) 10Jcrespo: mariadb-snapshots: Stop replication during transfer [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203)
[10:05:05] <wikibugs>	 (03PS3) 10Jcrespo: mariadb-backups: Setup db2097 as the source of some codfw backups [puppet] - 10https://gerrit.wikimedia.org/r/506947 (https://phabricator.wikimedia.org/T206203)
[10:05:07] <wikibugs>	 (03PS1) 10Jcrespo: mariadb-backups: Productionize db2100/1 for backup source of s7, s8 & x1 [puppet] - 10https://gerrit.wikimedia.org/r/507269 (https://phabricator.wikimedia.org/T220572)
[10:06:05] <wikibugs>	 (03PS4) 10Jbond: facter3/puppet5: update interface fact parsing [puppet] - 10https://gerrit.wikimedia.org/r/506651 (https://phabricator.wikimedia.org/T219803)
[10:07:00] <wikibugs>	 (03PS5) 10Jbond: puppet5/facter3: update canary [puppet] - 10https://gerrit.wikimedia.org/r/506652 (https://phabricator.wikimedia.org/T219803)
[10:08:42] <jynus>	 !log stop s7 and x1 instances on dbstore2* for cloning T220572
[10:08:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:47] <stashbot>	 T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host - https://phabricator.wikimedia.org/T220572
[10:08:53] <wikibugs>	 (03PS2) 10Ema: cache: add ATS nodes to cacheproxy::cron_restart [puppet] - 10https://gerrit.wikimedia.org/r/507267 (https://phabricator.wikimedia.org/T219967)
[10:10:08] <wikibugs>	 (03PS17) 10Mathew.onipe: elasticsearch: config file for aligning puppet config [puppet] - 10https://gerrit.wikimedia.org/r/506378 (https://phabricator.wikimedia.org/T218932)
[10:10:10] <wikibugs>	 (03PS9) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932)
[10:11:12] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: labtestvirt2003: now a spare system [puppet] - 10https://gerrit.wikimedia.org/r/507270 (https://phabricator.wikimedia.org/T222057)
[10:11:38] <wikibugs>	 (03PS1) 10Fsero: registryha,lvs: bug: bad icinga check for lvs service [puppet] - 10https://gerrit.wikimedia.org/r/507271
[10:11:49] <wikibugs>	 (03PS3) 10Ema: cache: add ATS nodes to cacheproxy::cron_restart [puppet] - 10https://gerrit.wikimedia.org/r/507267 (https://phabricator.wikimedia.org/T219967)
[10:12:04] <godog>	 !log rollout rsyslog upgrade to 8.1901.0-1~bpo9+wmf1 in eqsin / ulsfo / esams
[10:12:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:11] <wikibugs>	 (03PS2) 10Fsero: registryha,lvs: bug: bad icinga check for lvs service [puppet] - 10https://gerrit.wikimedia.org/r/507271
[10:12:51] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestvirt2003: now a spare system [puppet] - 10https://gerrit.wikimedia.org/r/507270 (https://phabricator.wikimedia.org/T222057) (owner: 10Arturo Borrero Gonzalez)
[10:13:47] <wikibugs>	 (03PS2) 10Jcrespo: mariadb-backups: Productionize db2100/1 for backup source of s7, s8 & x1 [puppet] - 10https://gerrit.wikimedia.org/r/507269 (https://phabricator.wikimedia.org/T220572)
[10:14:38] <wikibugs>	 (03PS1) 10Jbond: facter3: fix interface check on lvs systems [puppet] - 10https://gerrit.wikimedia.org/r/507272
[10:14:54] <wikibugs>	 (03PS4) 10Ema: cache: add ATS nodes to cacheproxy::cron_restart [puppet] - 10https://gerrit.wikimedia.org/r/507267 (https://phabricator.wikimedia.org/T219967)
[10:14:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] facter3: fix interface check on lvs systems [puppet] - 10https://gerrit.wikimedia.org/r/507272 (owner: 10Jbond)
[10:15:05] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb-backups: Productionize db2100/1 for backup source of s7, s8 & x1 [puppet] - 10https://gerrit.wikimedia.org/r/507269 (https://phabricator.wikimedia.org/T220572) (owner: 10Jcrespo)
[10:15:25] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Productionize db2100/1 for backup source of s7, s8 & x1 [puppet] - 10https://gerrit.wikimedia.org/r/507269 (https://phabricator.wikimedia.org/T220572) (owner: 10Jcrespo)
[10:15:53] <wikibugs>	 (03PS3) 10Fsero: registryha,lvs: bug: bad icinga check for lvs service [puppet] - 10https://gerrit.wikimedia.org/r/507271
[10:16:59] <wikibugs>	 (03CR) 10Jbond: "rebuild" [puppet] - 10https://gerrit.wikimedia.org/r/507272 (owner: 10Jbond)
[10:17:20] <wikibugs>	 (03PS2) 10Jbond: facter3: fix interface check on lvs systems [puppet] - 10https://gerrit.wikimedia.org/r/507272
[10:17:39] <icinga-wm>	 PROBLEM - puppet last run on restbase1008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools]
[10:17:48] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] registryha,lvs: bug: bad icinga check for lvs service [puppet] - 10https://gerrit.wikimedia.org/r/507271 (owner: 10Fsero)
[10:18:09] <wikibugs>	 (03CR) 10Fsero: [C: 03+2] registryha,lvs: bug: bad icinga check for lvs service [puppet] - 10https://gerrit.wikimedia.org/r/507271 (owner: 10Fsero)
[10:18:20] <wikibugs>	 (03PS4) 10Fsero: registryha,lvs: bug: bad icinga check for lvs service [puppet] - 10https://gerrit.wikimedia.org/r/507271
[10:19:29] <wikibugs>	 10Operations, 10serviceops, 10User-Elukey: Renew certs for mcrouter on all application servers. - https://phabricator.wikimedia.org/T221346 (10elukey) Looks sane to me!
[10:20:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/507272 (owner: 10Jbond)
[10:20:14] <fsero>	 jynus: is your change good to merge?
[10:20:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] facter3: fix interface check on lvs systems [puppet] - 10https://gerrit.wikimedia.org/r/507272 (owner: 10Jbond)
[10:20:25] <jynus>	 fsero: yes
[10:20:28] <jynus>	 I got distracted
[10:20:38] <fsero>	 merged
[10:20:38] <wikibugs>	 (03PS3) 10Jbond: facter3: fix interface check on lvs systems [puppet] - 10https://gerrit.wikimedia.org/r/507272
[10:22:35] <wikibugs>	 (03PS5) 10Ema: cache: add ATS nodes to cacheproxy::cron_restart [puppet] - 10https://gerrit.wikimedia.org/r/507267 (https://phabricator.wikimedia.org/T219967)
[10:22:59] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: labtestservices2003: reimage as spare stretch. [puppet] - 10https://gerrit.wikimedia.org/r/507273 (https://phabricator.wikimedia.org/T222060)
[10:23:11] <wikibugs>	 (03PS18) 10Mathew.onipe: elasticsearch: config file for aligning puppet config [puppet] - 10https://gerrit.wikimedia.org/r/506378 (https://phabricator.wikimedia.org/T218932)
[10:23:13] <wikibugs>	 (03PS10) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932)
[10:23:29] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp3038 is CRITICAL: CRITICAL: expiry mailbox lag is 2058435 https://wikitech.wikimedia.org/wiki/Varnish
[10:24:20] <ema>	 looking ^
[10:24:50] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: labtestservices2003: reimage as spare stretch. [puppet] - 10https://gerrit.wikimedia.org/r/507273 (https://phabricator.wikimedia.org/T222060)
[10:26:06] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestservices2003: reimage as spare stretch. [puppet] - 10https://gerrit.wikimedia.org/r/507273 (https://phabricator.wikimedia.org/T222060) (owner: 10Arturo Borrero Gonzalez)
[10:27:27] <ema>	 cp3038's scheduled backend restart is in 2 hours and it's not failing yet, waiting for cron to restart it
[10:28:47] <wikibugs>	 (03CR) 10Mathew.onipe: elasticsearch: config file for aligning puppet config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506378 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe)
[10:32:18] <arturo>	 !log T222057 reimaged labtestvirt2003 as spare system
[10:32:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:24] <stashbot>	 T222057:  labtestvirt2003.codfw.wmnet: reimage as spare stretch - https://phabricator.wikimedia.org/T222057
[10:32:43] <arturo>	 !log T222060 reimaged labtestservices2003 as stretch spare system
[10:32:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:47] <stashbot>	 T222060: labtestservices2003.wikimedia.org: reimage as spare stretch - https://phabricator.wikimedia.org/T222060
[10:32:56] <godog>	 !log rollout rsyslog upgrade to 8.1901.0-1~bpo9+wmf1 in codfw
[10:32:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:24] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2014 is OK: OK - running: The system is fully operational
[10:39:28] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: vagrant: refactor roles into profiles [puppet] - 10https://gerrit.wikimedia.org/r/507005 (https://phabricator.wikimedia.org/T221225)
[10:41:06] <icinga-wm>	 ACKNOWLEDGEMENT - docker-registry LVS codfw on docker-registry.svc.codfw.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string \{\} not found on https://docker-registry.svc.codfw.wmnet:443/v2/ - 292 bytes in 0.159 second response time Fsero bad icinga check https://wikitech.wikimedia.org/wiki/Docker-registry-runbook
[10:41:37] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] vagrant: refactor roles into profiles [puppet] - 10https://gerrit.wikimedia.org/r/507005 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez)
[10:42:22] <wikibugs>	 (03PS6) 10Jbond: puppet5/facter3: update canary [puppet] - 10https://gerrit.wikimedia.org/r/506652 (https://phabricator.wikimedia.org/T219803)
[10:43:19] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet5/facter3: update canary [puppet] - 10https://gerrit.wikimedia.org/r/506652 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond)
[10:44:20] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: cloudvps: introduce proper base role/profile for VM instances [puppet] - 10https://gerrit.wikimedia.org/r/506979 (https://phabricator.wikimedia.org/T221225)
[10:45:37] <logmsgbot>	 !log santhosh@deploy1001 scap-helm cxserver upgrade -f cxserver-staging-values.yaml staging stable/cxserver [namespace: cxserver, clusters: staging]
[10:45:38] <logmsgbot>	 !log santhosh@deploy1001 scap-helm cxserver cluster staging completed
[10:45:38] <logmsgbot>	 !log santhosh@deploy1001 scap-helm cxserver finished
[10:45:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:22] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC as expected: https://puppet-compiler.wmflabs.org/compiler1002/16196/" [puppet] - 10https://gerrit.wikimedia.org/r/506979 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez)
[10:48:23] <icinga-wm>	 RECOVERY - puppet last run on restbase1008 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[10:48:38] <logmsgbot>	 !log santhosh@deploy1001 scap-helm cxserver upgrade -f cxserver-eqiad-values.yaml production stable/cxserver [namespace: cxserver, clusters: eqiad]
[10:48:39] <logmsgbot>	 !log santhosh@deploy1001 scap-helm cxserver cluster eqiad completed
[10:48:39] <logmsgbot>	 !log santhosh@deploy1001 scap-helm cxserver finished
[10:48:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:48:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:48:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:49:19] <logmsgbot>	 !log santhosh@deploy1001 scap-helm cxserver upgrade -f cxserver-codfw-values.yaml production stable/cxserver [namespace: cxserver, clusters: codfw]
[10:49:20] <logmsgbot>	 !log santhosh@deploy1001 scap-helm cxserver cluster codfw completed
[10:49:20] <logmsgbot>	 !log santhosh@deploy1001 scap-helm cxserver finished
[10:49:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:49:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:49:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:49:57] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1296 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[10:50:03] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1294 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[10:51:01] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1296 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[10:51:07] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1294 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[10:51:41] <wikibugs>	 (03PS1) 10Volans: prometheus: fix base URL template [software/spicerack] - 10https://gerrit.wikimedia.org/r/507279
[10:51:43] <wikibugs>	 (03PS1) 10Volans: doc: autodoc missing API modules [software/spicerack] - 10https://gerrit.wikimedia.org/r/507280
[10:51:45] <wikibugs>	 (03PS1) 10Volans: doc: add checker to ensure modules are documented [software/spicerack] - 10https://gerrit.wikimedia.org/r/507281
[10:52:38] <wikibugs>	 (03PS1) 10Fsero: registryha,lvs: bug: modifying check LVS [puppet] - 10https://gerrit.wikimedia.org/r/507282 (https://phabricator.wikimedia.org/T221101)
[10:53:35] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "good catch lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/507279 (owner: 10Volans)
[10:53:44] <wikibugs>	 (03CR) 10Fsero: [C: 03+2] registryha,lvs: bug: modifying check LVS [puppet] - 10https://gerrit.wikimedia.org/r/507282 (https://phabricator.wikimedia.org/T221101) (owner: 10Fsero)
[10:55:22] <kart_>	 !log Updated cxserver to 2019-04-30-055331-production (T219412)
[10:55:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:26] <stashbot>	 T219412: CX2: Do not translate reference contents - https://phabricator.wikimedia.org/T219412
[10:55:34] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1299 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[10:56:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] doc: add checker to ensure modules are documented [software/spicerack] - 10https://gerrit.wikimedia.org/r/507281 (owner: 10Volans)
[10:56:34] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[10:56:37] <wikibugs>	 (03PS6) 10Ema: cache: add ATS nodes to cacheproxy::cron_restart [puppet] - 10https://gerrit.wikimedia.org/r/507267 (https://phabricator.wikimedia.org/T219967)
[10:57:14] <wikibugs>	 (03PS4) 10Lucas Werkmeister (WMDE): Serialize empty lists as objects on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507031 (https://phabricator.wikimedia.org/T138104)
[10:57:16] <wikibugs>	 (03PS5) 10Lucas Werkmeister (WMDE): Serialize empty lists as objects on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507032 (https://phabricator.wikimedia.org/T138104)
[10:57:45] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[10:57:52] <wikibugs>	 (03PS7) 10Ema: cache: add ATS nodes to cacheproxy::cron_restart [puppet] - 10https://gerrit.wikimedia.org/r/507267 (https://phabricator.wikimedia.org/T219967)
[10:58:45] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[10:59:01] <wikibugs>	 (03PS2) 10Volans: doc: add checker to ensure modules are documented [software/spicerack] - 10https://gerrit.wikimedia.org/r/507281
[10:59:14] <wikibugs>	 (03CR) 10Ema: [C: 03+2] cache: add ATS nodes to cacheproxy::cron_restart [puppet] - 10https://gerrit.wikimedia.org/r/507267 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema)
[11:00:04] <jouncebot>	 MaxSem, RoanKattouw, and Niharika: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190430T1100).
[11:00:04] <jouncebot>	 Lucas_WMDE and mlitn: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[11:00:11] <Lucas_WMDE>	 o/
[11:00:15] <matthiasmullie>	 check
[11:01:00] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: migrate endpoint from old registry instance to new one - https://phabricator.wikimedia.org/T221101 (10fsero)
[11:01:01] <Lucas_WMDE>	 can I start with my patches?
[11:01:17] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: migrate endpoint from old registry instance to new one - https://phabricator.wikimedia.org/T221101 (10fsero)
[11:01:41] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507031 (https://phabricator.wikimedia.org/T138104) (owner: 10Lucas Werkmeister (WMDE))
[11:01:43] <wikibugs>	 (03CR) 10Volans: [C: 03+2] prometheus: fix base URL template [software/spicerack] - 10https://gerrit.wikimedia.org/r/507279 (owner: 10Volans)
[11:01:51] <Lucas_WMDE>	 okay I’m starting
[11:02:37] <matthiasmullie>	 sure
[11:02:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Serialize empty lists as objects on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507031 (https://phabricator.wikimedia.org/T138104) (owner: 10Lucas Werkmeister (WMDE))
[11:02:54] <ema>	 !log cp3038 mbox lag, restarting varnish-be
[11:02:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:34] <Lucas_WMDE>	 huh https://integration.wikimedia.org/ci/job/operations-mw-config-typos-docker/4346/console
[11:03:42] <Lucas_WMDE>	 oh no
[11:04:02] <Lucas_WMDE>	 not this thing? T222131
[11:04:02] <stashbot>	 T222131: mediawiki-quibble-composertest-php70-docker failure: Unable to find image 'docker-registry.wikimedia.org/releng/castor:0.2.0' locally - https://phabricator.wikimedia.org/T222131
[11:04:38] <wikibugs>	 (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+2] Fix coal syslog logging [puppet] - 10https://gerrit.wikimedia.org/r/506626 (https://phabricator.wikimedia.org/T221401) (owner: 10Gilles)
[11:04:49] <wikibugs>	 (03PS4) 10Effie Mouzeli: Fix coal syslog logging [puppet] - 10https://gerrit.wikimedia.org/r/506626 (https://phabricator.wikimedia.org/T221401) (owner: 10Gilles)
[11:05:06] <wikibugs>	 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 2 others: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10Gilles) I've tracked down the root cause of the issue: https://github.com/dpkp/kafka-python/issues/1774  For other uses of python-kafka we have, we simp...
[11:05:22] <Lucas_WMDE>	 let’s see if any other config patches got successfully merged recently
[11:05:28] <wikibugs>	 (03CR) 10Muehlenhoff: "| tbh i think for notebook and possibly the stat servers it may be easier to exclude everything unless explicitly requested?" [puppet] - 10https://gerrit.wikimedia.org/r/507056 (owner: 10Jbond)
[11:07:05] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "let’s retry that" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507031 (https://phabricator.wikimedia.org/T138104) (owner: 10Lucas Werkmeister (WMDE))
[11:07:07] <wikibugs>	 (03Merged) 10jenkins-bot: prometheus: fix base URL template [software/spicerack] - 10https://gerrit.wikimedia.org/r/507279 (owner: 10Volans)
[11:07:09] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507031 (https://phabricator.wikimedia.org/T138104) (owner: 10Lucas Werkmeister (WMDE))
[11:08:02] <wikibugs>	 (03CR) 10jenkins-bot: prometheus: fix base URL template [software/spicerack] - 10https://gerrit.wikimedia.org/r/507279 (owner: 10Volans)
[11:08:27] <wikibugs>	 (03Merged) 10jenkins-bot: Serialize empty lists as objects on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507031 (https://phabricator.wikimedia.org/T138104) (owner: 10Lucas Werkmeister (WMDE))
[11:08:31] <logmsgbot>	 !log gilles@deploy1001 Started deploy [performance/navtiming@d6756c0]: T221848 Proper fix for partitions_for_topic in python-kafka > 1.4.4
[11:08:37] <logmsgbot>	 !log gilles@deploy1001 Finished deploy [performance/navtiming@d6756c0]: T221848 Proper fix for partitions_for_topic in python-kafka > 1.4.4 (duration: 00m 05s)
[11:08:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:08:37] <stashbot>	 T221848: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848
[11:08:41] <wikibugs>	 (03CR) 10jenkins-bot: Serialize empty lists as objects on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507031 (https://phabricator.wikimedia.org/T138104) (owner: 10Lucas Werkmeister (WMDE))
[11:08:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:08:45] <Lucas_WMDE>	 okay looks like it worked this time
[11:08:52] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Allow cross-site requests from mobile domains (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506151 (https://phabricator.wikimedia.org/T221734) (owner: 10Matthias Mullie)
[11:09:24] <Lucas_WMDE>	 Lucas_WMDE: your first patch is on mwdebug1002, please test
[11:10:14] <Lucas_WMDE>	 working as expected, deploying
[11:12:00] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:507031|Serialize empty lists as objects on Wikidata (T138104)]] (duration: 00m 55s)
[11:12:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:04] <stashbot>	 T138104: Do not serialize empty containers (descriptions/aliases/sitelinks) as empty array [] - https://phabricator.wikimedia.org/T138104
[11:12:24] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507032 (https://phabricator.wikimedia.org/T138104) (owner: 10Lucas Werkmeister (WMDE))
[11:13:05] <wikibugs>	 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 2 others: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10Gilles)
[11:13:28] <wikibugs>	 (03Merged) 10jenkins-bot: Serialize empty lists as objects on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507032 (https://phabricator.wikimedia.org/T138104) (owner: 10Lucas Werkmeister (WMDE))
[11:13:58] <wikibugs>	 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 2 others: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10Gilles) @Ottomata all our services are good now, you can go ahead with upgrading EventLogging and Hadoop.
[11:14:01] <Lucas_WMDE>	 Lucas_WMDE: your second patch is on mwdebug1002, please test
[11:14:11] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp3038 is OK: OK: expiry mailbox lag is 0 https://wikitech.wikimedia.org/wiki/Varnish
[11:14:31] <Lucas_WMDE>	 also working as expected, deploying
[11:15:48] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:507032|Serialize empty lists as objects on Commons (T138104)]] (duration: 00m 54s)
[11:15:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:15:56] <Lucas_WMDE>	 okay, I’m done
[11:16:04] <Lucas_WMDE>	 matthiasmullie: can you deploy your own change?
[11:16:15] <matthiasmullie>	 yeah sure
[11:16:18] <Lucas_WMDE>	 ok
[11:16:23] <matthiasmullie>	 thanks
[11:16:56] <wikibugs>	 (03PS4) 10Matthias Mullie: Allow cross-site requests from mobile domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506151 (https://phabricator.wikimedia.org/T221734)
[11:17:06] <wikibugs>	 10Operations, 10Tools, 10cloud-services-team (Kanban): Rebuild toollabs docker images based on wikimedia-jessie - https://phabricator.wikimedia.org/T219751 (10hashar) On a freshly created instance that causes apt to fail and causes the puppet-agent-cronjob to fail: ` Apr 30 06:15:02 integration-slave-docker-...
[11:18:22] <wikibugs>	 (03CR) 10Matthias Mullie: [C: 03+2] Allow cross-site requests from mobile domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506151 (https://phabricator.wikimedia.org/T221734) (owner: 10Matthias Mullie)
[11:18:24] <wikibugs>	 (03PS14) 10Arturo Borrero Gonzalez: sudo: decouple sudo from sudo-ldap [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225)
[11:18:37] <wikibugs>	 10Operations, 10Puppet, 10Icinga, 10monitoring, 10Patch-For-Review: Puppet failing without Icinga alert in case of dependency cycle - https://phabricator.wikimedia.org/T221784 (10ema)
[11:18:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sudo: decouple sudo from sudo-ldap [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez)
[11:19:20] <wikibugs>	 (03Merged) 10jenkins-bot: Allow cross-site requests from mobile domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506151 (https://phabricator.wikimedia.org/T221734) (owner: 10Matthias Mullie)
[11:20:07] <wikibugs>	 (03CR) 10jenkins-bot: Serialize empty lists as objects on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507032 (https://phabricator.wikimedia.org/T138104) (owner: 10Lucas Werkmeister (WMDE))
[11:20:09] <wikibugs>	 (03CR) 10jenkins-bot: Allow cross-site requests from mobile domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506151 (https://phabricator.wikimedia.org/T221734) (owner: 10Matthias Mullie)
[11:20:26] <wikibugs>	 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond) >>! In T219803#5139829, @MoritzMuehlenhoff wrote: > One thing that will need to be fixed is the detection of HP machines to install 'hp-health' in module...
[11:21:54] <wikibugs>	 (03PS15) 10Arturo Borrero Gonzalez: sudo: decouple sudo from sudo-ldap [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225)
[11:21:59] <wikibugs>	 10Operations, 10Puppet: facter3: use structured facts - https://phabricator.wikimedia.org/T222160 (10jbond) p:05Triage→03Normal
[11:22:28] <logmsgbot>	 !log mlitn@deploy1001 Synchronized wmf-config/CommonSettings.php: Allow cross-site requests from mobile domains (duration: 00m 52s)
[11:22:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:22:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sudo: decouple sudo from sudo-ldap [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez)
[11:24:19] <wikibugs>	 (03PS1) 10Ema: cache: always define fe_transient_storage [puppet] - 10https://gerrit.wikimedia.org/r/507291
[11:24:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cache: always define fe_transient_storage [puppet] - 10https://gerrit.wikimedia.org/r/507291 (owner: 10Ema)
[11:25:12] <wikibugs>	 10Operations, 10Puppet, 10Icinga, 10monitoring, 10Patch-For-Review: Puppet failing without Icinga alert in case of dependency cycle - https://phabricator.wikimedia.org/T221784 (10Volans) After a bit of digging with @ema we found that in this case the `/var/lib/puppet/state/last_run_summary.yaml` file is...
[11:25:55] <wikibugs>	 (03PS2) 10Ema: cache: always define fe_transient_storage [puppet] - 10https://gerrit.wikimedia.org/r/507291
[11:26:38] <matthiasmullie>	 I'm done
[11:26:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cache: always define fe_transient_storage [puppet] - 10https://gerrit.wikimedia.org/r/507291 (owner: 10Ema)
[11:26:57] <Lucas_WMDE>	 me too, and nothing else in the deployment calendar, so
[11:27:00] <Lucas_WMDE>	 !log EU SWAT done
[11:27:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:45] <wikibugs>	 (03PS3) 10Ema: cache: always define fe_transient_storage [puppet] - 10https://gerrit.wikimedia.org/r/507291
[11:29:50] <wikibugs>	 (03CR) 10Ema: [C: 03+2] cache: always define fe_transient_storage [puppet] - 10https://gerrit.wikimedia.org/r/507291 (owner: 10Ema)
[11:35:22] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure: Jessie rsyslog_8.1901.0-1~bpo8+wmf1_amd64.deb package fails to upgrade - https://phabricator.wikimedia.org/T222166 (10hashar)
[11:36:39] <wikibugs>	 (03PS3) 10Ema: cache: reimage cp4022 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/507266 (https://phabricator.wikimedia.org/T219967)
[11:40:43] <wikibugs>	 (03PS11) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932)
[11:41:04] <wikibugs>	 (03CR) 10Mathew.onipe: icinga: create and apply cirrus config check (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe)
[11:44:40] <wikibugs>	 (03PS1) 10Elukey: Deprecate GELF logging [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507297
[11:48:05] <wikibugs>	 (03CR) 10Gilles: "@ema anything else you need me to do on this patch?" [puppet] - 10https://gerrit.wikimedia.org/r/495643 (https://phabricator.wikimedia.org/T216339) (owner: 10Gilles)
[11:53:23] <wikibugs>	 (03PS16) 10Arturo Borrero Gonzalez: sudo: decouple sudo from sudo-ldap [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225)
[11:58:54] <icinga-wm>	 PROBLEM - puppet last run on mw1348 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190430T1200)
[12:00:10] <wikibugs>	 (03PS1) 10Jbond: facter3/puppet5: enable puppet5/facter3 eqsin [puppet] - 10https://gerrit.wikimedia.org/r/507299 (https://phabricator.wikimedia.org/T219803)
[12:00:12] <wikibugs>	 (03PS1) 10Jbond: facter3/puppet5: enable puppet5/facter3 esams [puppet] - 10https://gerrit.wikimedia.org/r/507300 (https://phabricator.wikimedia.org/T219803)
[12:00:14] <wikibugs>	 (03PS1) 10Jbond: facter3/puppet5: enable puppet5/facter3 ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/507301 (https://phabricator.wikimedia.org/T219803)
[12:00:18] <wikibugs>	 (03PS1) 10Jbond: facter3/puppet5: enable puppet5/facter3 codfw [puppet] - 10https://gerrit.wikimedia.org/r/507302 (https://phabricator.wikimedia.org/T219803)
[12:00:20] <wikibugs>	 (03PS1) 10Jbond: facter3/puppet5: enable puppet5/facter3 eqiad [puppet] - 10https://gerrit.wikimedia.org/r/507303 (https://phabricator.wikimedia.org/T219803)
[12:00:23] <wikibugs>	 (03PS1) 10Jbond: facter3/puppet5: change default versions [puppet] - 10https://gerrit.wikimedia.org/r/507304
[12:00:25] <wikibugs>	 (03PS1) 10Jbond: facter3/puppet5: clean up old config [puppet] - 10https://gerrit.wikimedia.org/r/507305 (https://phabricator.wikimedia.org/T219803)
[12:02:37] <revi>	 who should I contact to get https://gerrit.wikimedia.org/r/c/operations/puppet/+/506895 merged?
[12:02:40] <wikibugs>	 (03PS15) 10Jcrespo: mariadb-snapshots: Setup full daily snapshots for all codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/500980 (https://phabricator.wikimedia.org/T206203)
[12:02:52] <wikibugs>	 (03PS13) 10Jcrespo: mariadb-snapshots: Stop replication during transfer [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203)
[12:02:54] <wikibugs>	 (03PS4) 10Jcrespo: mariadb-backups: Setup new backup source hosts for codfw backups [puppet] - 10https://gerrit.wikimedia.org/r/506947 (https://phabricator.wikimedia.org/T206203)
[12:03:32] <wikibugs>	 (03PS1) 10Elukey: Enable hdfs-audit log [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507306
[12:03:34] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic: Send peering requests to AS with the worst TTFB - https://phabricator.wikimedia.org/T219486 (10Gilles) p:05Normal→03Low
[12:03:51] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-2] "Pending complete setup of db2100." [puppet] - 10https://gerrit.wikimedia.org/r/506947 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo)
[12:07:22] <wikibugs>	 (03CR) 10Joal: "One comment on comment :)" (031 comment) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507306 (owner: 10Elukey)
[12:11:10] <wikibugs>	 (03PS1) 10Gilles: Proxy Thumbor 404s as-is [puppet] - 10https://gerrit.wikimedia.org/r/507307 (https://phabricator.wikimedia.org/T222071)
[12:11:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Proxy Thumbor 404s as-is [puppet] - 10https://gerrit.wikimedia.org/r/507307 (https://phabricator.wikimedia.org/T222071) (owner: 10Gilles)
[12:14:37] <wikibugs>	 (03PS2) 10Gilles: Proxy Thumbor 404s as-is [puppet] - 10https://gerrit.wikimedia.org/r/507307 (https://phabricator.wikimedia.org/T222071)
[12:19:23] <wikibugs>	 (03PS2) 10Elukey: Enable hdfs-audit log [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507306
[12:21:22] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "LGTM!" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507306 (owner: 10Elukey)
[12:25:20] <icinga-wm>	 RECOVERY - puppet last run on mw1348 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[12:26:55] <wikibugs>	 (03PS2) 10CDanis: prometheus: 10M max-samples for all instances [puppet] - 10https://gerrit.wikimedia.org/r/507210 (https://phabricator.wikimedia.org/T222105)
[12:30:21] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] profile::mediawiki::nutcracker: make memcached configuration optional [puppet] - 10https://gerrit.wikimedia.org/r/504831 (https://phabricator.wikimedia.org/T214275) (owner: 10Elukey)
[12:32:41] <cdanis>	 !log cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'R:prometheus::server' 'disable-puppet "staged rollout T222105 by cdanis"'
[12:32:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:46] <stashbot>	 T222105: prometheus: current query limits are insufficient to prevent OOMs - https://phabricator.wikimedia.org/T222105
[12:32:53] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] prometheus: 10M max-samples for all instances [puppet] - 10https://gerrit.wikimedia.org/r/507210 (https://phabricator.wikimedia.org/T222105) (owner: 10CDanis)
[12:34:52] <elukey>	 !log moved /home to /srv/home (more space in a dedicated partition) on stat1005
[12:34:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:59] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/506719 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn)
[12:36:46] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[12:37:25] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC as expected: https://puppet-compiler.wmflabs.org/compiler1002/16201/" [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez)
[12:37:34] <wikibugs>	 (03PS17) 10Arturo Borrero Gonzalez: sudo: decouple sudo from sudo-ldap [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225)
[12:39:05] <cdanis>	 !log cdanis@prometheus1004.eqiad.wmnet ~ % sudo run-puppet-agent --enable "staged rollout T222105 by cdanis"
[12:39:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:09] <stashbot>	 T222105: prometheus: current query limits are insufficient to prevent OOMs - https://phabricator.wikimedia.org/T222105
[12:41:17] <arturo>	 !log merging a sudo puppet module change
[12:41:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:05] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/506750 (owner: 10Dzahn)
[12:46:14] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1301 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[12:47:01] <cdanis>	 !log cdanis@prometheus1003.eqiad.wmnet ~ % sudo run-puppet-agent --enable "staged rollout T222105 by cdanis"
[12:47:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:05] <stashbot>	 T222105: prometheus: current query limits are insufficient to prevent OOMs - https://phabricator.wikimedia.org/T222105
[12:47:30] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1301 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[12:48:24] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[12:51:31] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/507095 (owner: 10Alex Monk)
[12:53:18] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1309 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[12:54:38] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1309 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[12:55:12] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] ":)" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507250 (https://phabricator.wikimedia.org/T220702) (owner: 10Elukey)
[12:55:41] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] Deprecate GELF logging [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507297 (owner: 10Elukey)
[12:56:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/507280 (owner: 10Volans)
[12:58:36] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:59:33] <wikibugs>	 (03CR) 10Volans: [C: 03+2] doc: autodoc missing API modules [software/spicerack] - 10https://gerrit.wikimedia.org/r/507280 (owner: 10Volans)
[13:00:05] <jouncebot>	 Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190430T1300)
[13:02:51] <cdanis>	 !log OOMed the eqiad ops prometheus @ prometheus1004
[13:02:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:03:10] <icinga-wm>	 PROBLEM - configured eth on prometheus1004 is CRITICAL: connect to address 10.64.16.38 port 5666: Connection refused
[13:03:16] <icinga-wm>	 PROBLEM - Check size of conntrack table on prometheus1004 is CRITICAL: connect to address 10.64.16.38 port 5666: Connection refused
[13:03:32] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on prometheus1004 is CRITICAL: connect to address 10.64.16.38 port 5666: Connection refused
[13:03:38] <icinga-wm>	 PROBLEM - Disk space on prometheus1004 is CRITICAL: connect to address 10.64.16.38 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[13:04:00] <icinga-wm>	 PROBLEM - DPKG on prometheus1004 is CRITICAL: connect to address 10.64.16.38 port 5666: Connection refused
[13:04:01] <wikibugs>	 (03Merged) 10jenkins-bot: doc: autodoc missing API modules [software/spicerack] - 10https://gerrit.wikimedia.org/r/507280 (owner: 10Volans)
[13:04:06] <icinga-wm>	 PROBLEM - Check systemd state on prometheus1004 is CRITICAL: connect to address 10.64.16.38 port 5666: Connection refused
[13:04:20] <icinga-wm>	 PROBLEM - dhclient process on prometheus1004 is CRITICAL: connect to address 10.64.16.38 port 5666: Connection refused
[13:04:54] <wikibugs>	 (03CR) 10jenkins-bot: doc: autodoc missing API modules [software/spicerack] - 10https://gerrit.wikimedia.org/r/507280 (owner: 10Volans)
[13:06:04] <wikibugs>	 (03PS1) 10Ottomata: Add cumin aliasaes for schema* [puppet] - 10https://gerrit.wikimedia.org/r/507312 (https://phabricator.wikimedia.org/T219556)
[13:07:00] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10EventBus, and 5 others: Create schema[12]00[12] (schema.svc.{eqiad,codfw}.wmnet) - https://phabricator.wikimedia.org/T219556 (10Ottomata) Done, thanks.  Also edited https://wikitech.wikimedia.org/wiki/Ganeti#Assign_a_hostname%2FIP with instructions for futur...
[13:07:50] <icinga-wm>	 PROBLEM - puppet last run on prometheus1004 is CRITICAL: connect to address 10.64.16.38 port 5666: Connection refused
[13:08:03] <cdanis>	 !log OOMed the eqiad ops prometheus @ prometheus1003
[13:08:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:08] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on prometheus1004 is CRITICAL: connect to address 10.64.16.38 port 5666: Connection refused
[13:14:36] <icinga-wm>	 PROBLEM - puppet last run on mw1321 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:15:09] <cdanis>	 !log cdanis@prometheus1003.eqiad.wmnet ~ % sudo disable-puppet 'cdanis testing original query.max-samples T222105'
[13:15:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:14] <stashbot>	 T222105: prometheus: current query limits are insufficient to prevent OOMs - https://phabricator.wikimedia.org/T222105
[13:15:48] <wikibugs>	 (03PS1) 10Ema: varnish: add Vagrantfile to run VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/507316
[13:15:50] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: Revert "sudo: decouple sudo from sudo-ldap" [puppet] - 10https://gerrit.wikimedia.org/r/507317 (https://phabricator.wikimedia.org/T221225)
[13:16:29] <cdanis>	 !log cdanis@prometheus1003.eqiad.wmnet ~ % sudo systemctl restart prometheus@ops.service
[13:16:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:16:42] <icinga-wm>	 RECOVERY - Disk space on prometheus1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[13:17:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "sudo: decouple sudo from sudo-ldap" [puppet] - 10https://gerrit.wikimedia.org/r/507317 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez)
[13:17:06] <icinga-wm>	 RECOVERY - DPKG on prometheus1004 is OK: All packages OK
[13:17:10] <icinga-wm>	 RECOVERY - Check systemd state on prometheus1004 is OK: OK - running: The system is fully operational
[13:17:24] <icinga-wm>	 RECOVERY - dhclient process on prometheus1004 is OK: PROCS OK: 0 processes with command name dhclient
[13:17:34] <icinga-wm>	 RECOVERY - configured eth on prometheus1004 is OK: OK - interfaces up
[13:17:38] <icinga-wm>	 RECOVERY - Check size of conntrack table on prometheus1004 is OK: OK: nf_conntrack is 3 % full
[13:17:49] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] "Ignoring jenkins-bot, the linter is complainig about stuff that were fixed in my patch attempt that I'm now reverting." [puppet] - 10https://gerrit.wikimedia.org/r/507317 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez)
[13:17:56] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on prometheus1004 is OK: OK ferm input default policy is set
[13:18:28] <icinga-wm>	 RECOVERY - puppet last run on prometheus1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:20:38] <wikibugs>	 (03CR) 10Gilles: [C: 03+1] "Works perfectly, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/507316 (owner: 10Ema)
[13:20:40] <arturo>	 !log reverting sudo puppet module changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/507317
[13:20:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:17] <wikibugs>	 (03CR) 10Ema: [C: 03+1] "Looks reasonable to me. Commit message nit." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507307 (https://phabricator.wikimedia.org/T222071) (owner: 10Gilles)
[13:23:37] <wikibugs>	 (03PS2) 10Ema: varnish: add Vagrantfile to run VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/507316
[13:24:18] <wikibugs>	 (03CR) 10Ema: [C: 03+2] varnish: add Vagrantfile to run VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/507316 (owner: 10Ema)
[13:24:23] <wikibugs>	 10Operations, 10Cassandra, 10Core Platform Team Kanban (Done with CPT), 10Services (done), 10User-Eevans: Revisit default settings for c-foreach-restart - https://phabricator.wikimedia.org/T198787 (10MoritzMuehlenhoff) The version of c-foreach-restart as currently deployed on restbase* doesn't seem to us...
[13:24:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM added a minor comment" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/507281 (owner: 10Volans)
[13:25:11] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "This change was reverted in https://gerrit.wikimedia.org/r/c/operations/puppet/+/507317" [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez)
[13:28:19] <ema>	 !log depool cp4022 and reimage as upload_ats T219967
[13:28:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:23] <stashbot>	 T219967: Replace Varnish backends with ATS on cache upload nodes in ulsfo   - https://phabricator.wikimedia.org/T219967
[13:29:53] <cdanis>	 !log cdanis@prometheus1004.eqiad.wmnet ~ % sudo systemctl restart prometheus@ops.service 
[13:29:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:03] <wikibugs>	 (03PS4) 10Ema: cache: reimage cp4022 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/507266 (https://phabricator.wikimedia.org/T219967)
[13:32:29] <wikibugs>	 (03CR) 10Ema: [C: 03+2] cache: reimage cp4022 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/507266 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema)
[13:35:04] <wikibugs>	 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp4022.ulsfo.wmnet'] ` The log can be...
[13:35:16] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2015 is OK: OK - running: The system is fully operational
[13:40:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] facter3/puppet5: enable puppet5/facter3 eqsin [puppet] - 10https://gerrit.wikimedia.org/r/507299 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond)
[13:40:15] <wikibugs>	 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 2 others: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10Ottomata)
[13:40:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] facter3/puppet5: enable puppet5/facter3 esams [puppet] - 10https://gerrit.wikimedia.org/r/507300 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond)
[13:40:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] facter3/puppet5: enable puppet5/facter3 ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/507301 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond)
[13:40:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] facter3/puppet5: enable puppet5/facter3 codfw [puppet] - 10https://gerrit.wikimedia.org/r/507302 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond)
[13:41:01] <wikibugs>	 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 2 others: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10Ottomata)
[13:41:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] facter3/puppet5: enable puppet5/facter3 eqiad [puppet] - 10https://gerrit.wikimedia.org/r/507303 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond)
[13:42:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] facter3/puppet5: change default versions [puppet] - 10https://gerrit.wikimedia.org/r/507304 (owner: 10Jbond)
[13:43:56] <wikibugs>	 (03PS3) 10Gilles: Proxy Thumbor errors as-is [puppet] - 10https://gerrit.wikimedia.org/r/507307 (https://phabricator.wikimedia.org/T222071)
[13:44:04] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "One nit but +1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506719 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn)
[13:44:08] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on prometheus1004 is OK: OK: synced at Tue 2019-04-30 13:44:07 UTC.
[13:44:09] <wikibugs>	 (03CR) 10Gilles: Proxy Thumbor errors as-is (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507307 (https://phabricator.wikimedia.org/T222071) (owner: 10Gilles)
[13:44:40] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] kafkatee: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/506557 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn)
[13:45:48] <icinga-wm>	 RECOVERY - puppet last run on mw1321 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[13:46:48] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] Enable hdfs-audit log [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507306 (owner: 10Elukey)
[13:48:33] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/506171 (owner: 10Muehlenhoff)
[13:48:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/507307 (https://phabricator.wikimedia.org/T222071) (owner: 10Gilles)
[13:51:08] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure: Jessie rsyslog_8.1901.0-1~bpo8+wmf1_amd64.deb package fails to upgrade - https://phabricator.wikimedia.org/T222166 (10fgiunchedi) Indeed, we've ran into the same problem on {T219764}. tl;dr the solution is to repeat the install: `apt install rsyslog rsyslo...
[13:51:48] <wikibugs>	 (03PS6) 10Jbond: kafka shipper: move kafka rsyslog shipping to base profile [puppet] - 10https://gerrit.wikimedia.org/r/505737 (https://phabricator.wikimedia.org/T220987)
[13:52:25] <wikibugs>	 (03PS4) 10Jbond: kafka shipper: add ulogd to kafka forwarding rules [puppet] - 10https://gerrit.wikimedia.org/r/505750 (https://phabricator.wikimedia.org/T220987)
[13:52:50] <wikibugs>	 (03PS5) 10Jbond: logstash: add ulog parser to logstash [puppet] - 10https://gerrit.wikimedia.org/r/506400 (https://phabricator.wikimedia.org/T220987)
[13:55:13] <wikibugs>	 (03PS1) 10BBlack: Revert "Add CNAME-variant langlist template" [dns] - 10https://gerrit.wikimedia.org/r/507321 (https://phabricator.wikimedia.org/T208263)
[13:55:16] <wikibugs>	 (03PS1) 10Volans: doc: mark Sphinx warnings as error [software/spicerack] - 10https://gerrit.wikimedia.org/r/507322
[13:55:52] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Revert "Add CNAME-variant langlist template" [dns] - 10https://gerrit.wikimedia.org/r/507321 (https://phabricator.wikimedia.org/T208263) (owner: 10BBlack)
[13:55:54] <wikibugs>	 (03CR) 10Volans: [C: 03+2] doc: add checker to ensure modules are documented [software/spicerack] - 10https://gerrit.wikimedia.org/r/507281 (owner: 10Volans)
[13:57:28] <wikibugs>	 (03PS4) 10BBlack: wm.org no-op cleanup: no empty left-hand-side [dns] - 10https://gerrit.wikimedia.org/r/501610
[13:57:28] <wikibugs>	 (03PS4) 10BBlack: wm.org no-op cleanup: prefer @ to wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/501611
[13:57:30] <wikibugs>	 (03PS4) 10BBlack: wm.org no-op cleanup: Group on hostnames for meta [dns] - 10https://gerrit.wikimedia.org/r/501612
[13:57:32] <wikibugs>	 (03PS4) 10BBlack: wm.org no-op cleanup: re-arrange top section [dns] - 10https://gerrit.wikimedia.org/r/501613
[13:57:44] <wikibugs>	 (03PS2) 10BBlack: wm.org no-op cleanup: move other meta up from end [dns] - 10https://gerrit.wikimedia.org/r/507093
[13:57:45] <wikibugs>	 (03PS4) 10BBlack: ns[012].wikimedia.org: 1D TTLs [dns] - 10https://gerrit.wikimedia.org/r/501614
[13:57:47] <wikibugs>	 (03PS5) 10BBlack: wm.org: 1h non-dyna records for foo.dcname entries [dns] - 10https://gerrit.wikimedia.org/r/501615
[13:59:11] <wikibugs>	 (03Abandoned) 10Ema: varnishtest: mock VCL configuration [puppet] - 10https://gerrit.wikimedia.org/r/340511 (owner: 10Ema)
[14:01:09] <wikibugs>	 (03Merged) 10jenkins-bot: doc: add checker to ensure modules are documented [software/spicerack] - 10https://gerrit.wikimedia.org/r/507281 (owner: 10Volans)
[14:02:06] <wikibugs>	 (03CR) 10jenkins-bot: doc: add checker to ensure modules are documented [software/spicerack] - 10https://gerrit.wikimedia.org/r/507281 (owner: 10Volans)
[14:04:12] <wikibugs>	 (03PS1) 10Ema: Revert "Revert "package_builder: move lintian out of require_package"" [puppet] - 10https://gerrit.wikimedia.org/r/507324 (https://phabricator.wikimedia.org/T221784)
[14:05:27] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] wm.org no-op cleanup: no empty left-hand-side [dns] - 10https://gerrit.wikimedia.org/r/501610 (owner: 10BBlack)
[14:05:32] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] wm.org no-op cleanup: prefer @ to wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/501611 (owner: 10BBlack)
[14:05:35] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] wm.org no-op cleanup: Group on hostnames for meta [dns] - 10https://gerrit.wikimedia.org/r/501612 (owner: 10BBlack)
[14:05:38] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] wm.org no-op cleanup: re-arrange top section [dns] - 10https://gerrit.wikimedia.org/r/501613 (owner: 10BBlack)
[14:05:43] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] wm.org no-op cleanup: move other meta up from end [dns] - 10https://gerrit.wikimedia.org/r/507093 (owner: 10BBlack)
[14:09:41] <wikibugs>	 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4022.ulsfo.wmnet'] `  Of which those **FAILED**: ` ['cp4022.ulsfo.wmnet'] `
[14:09:56] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] ns[012].wikimedia.org: 1D TTLs [dns] - 10https://gerrit.wikimedia.org/r/501614 (owner: 10BBlack)
[14:11:46] <icinga-wm>	 PROBLEM - Long running screen/tmux on ganeti2003 is CRITICAL: CRIT: Long running tmux process. (user: fsero PID: 3700, 1739799s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens
[14:11:53] <wikibugs>	 10Operations, 10cloud-services-team, 10serviceops: Change /r/p/ to /r/ on all hosts (where https://gerrit.wikimedia.org/r/p/ exists) - https://phabricator.wikimedia.org/T222093 (10Paladox)
[14:12:11] <wikibugs>	 10Operations, 10cloud-services-team, 10serviceops: Change /r/p/ to /r/ on all hosts (where https://gerrit.wikimedia.org/r/p/ exists) - https://phabricator.wikimedia.org/T222093 (10Paladox)
[14:12:33] <wikibugs>	 (03CR) 10Volans: icinga: create and apply cirrus config check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe)
[14:13:49] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Support upgrades which introduce changes to binary package names (client side) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/506171 (owner: 10Muehlenhoff)
[14:14:56] <wikibugs>	 10Operations: Integrate Stretch 9.9 point update - https://phabricator.wikimedia.org/T222053 (10MoritzMuehlenhoff)
[14:15:16] <cdanis>	 !log cdanis@prometheus1003.eqiad.wmnet ~ % sudo enable-puppet 'cdanis testing original query.max-samples T222105'
[14:15:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:21] <stashbot>	 T222105: prometheus: current query limits are insufficient to prevent OOMs - https://phabricator.wikimedia.org/T222105
[14:16:30] <wikibugs>	 10Operations: Integrate Stretch 9.9 point update - https://phabricator.wikimedia.org/T222053 (10MoritzMuehlenhoff)
[14:16:33] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/507322 (owner: 10Volans)
[14:17:04] <cdanis>	 !log cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'prometheus2003*' 'run-puppet-agent --enable "staged rollout T222105 by cdanis"'
[14:17:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:18:24] <wikibugs>	 (03PS1) 10Ottomata: eventschemas::service - gzip minified static css/js files [puppet] - 10https://gerrit.wikimedia.org/r/507326 (https://phabricator.wikimedia.org/T219552)
[14:18:34] <wikibugs>	 (03CR) 10Volans: [C: 03+2] doc: mark Sphinx warnings as error [software/spicerack] - 10https://gerrit.wikimedia.org/r/507322 (owner: 10Volans)
[14:19:12] <bblack>	 looks like we had more artificial prometheus dropouts on eqiad DNS data in the past couple of hours
[14:19:15] <bblack>	 https://grafana.wikimedia.org/d/000000341/dns?orgId=1&from=now-3h&to=now
[14:19:16] <wikibugs>	 (03PS2) 10Ottomata: eventschemas::service - gzip minified static css/js files [puppet] - 10https://gerrit.wikimedia.org/r/507326 (https://phabricator.wikimedia.org/T219552)
[14:20:05] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/506955 (owner: 10Volans)
[14:20:28] <wikibugs>	 (03PS1) 10Muehlenhoff: Drop trusty from debdeploy config [puppet] - 10https://gerrit.wikimedia.org/r/507327
[14:20:44] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventschemas::service - gzip minified static css/js files [puppet] - 10https://gerrit.wikimedia.org/r/507326 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata)
[14:21:58] <wikibugs>	 10Operations, 10Discovery-Search, 10Wikidata, 10Wikidata-Query-Service: Create Cookbook to restart WDQS - https://phabricator.wikimedia.org/T221832 (10Mathew.onipe) @Lucas_Werkmeister_WMDE Thanks! will keep that mind. Your reviews will be welcome when I submit a patch too.
[14:22:23] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Create Cookbook to restart WDQS - https://phabricator.wikimedia.org/T221832 (10Mathew.onipe) p:05Triage→03Normal
[14:22:37] <wikibugs>	 10Operations, 10observability, 10Wikimedia-Incident: figure out why Kafka dashboard hammers Prometheus, and fix it - https://phabricator.wikimedia.org/T222112 (10CDanis) I'm pretty sure it is these panels that are responsible for the most Prometheus load {F28868610} They take much longer to load than the res...
[14:22:38] <wikibugs>	 (03Merged) 10jenkins-bot: doc: mark Sphinx warnings as error [software/spicerack] - 10https://gerrit.wikimedia.org/r/507322 (owner: 10Volans)
[14:23:31] <wikibugs>	 (03CR) 10jenkins-bot: doc: mark Sphinx warnings as error [software/spicerack] - 10https://gerrit.wikimedia.org/r/507322 (owner: 10Volans)
[14:23:49] <wikibugs>	 (03PS3) 10Volans: cookbook API: drop get_title() support [software/spicerack] - 10https://gerrit.wikimedia.org/r/506955
[14:24:30] <cdanis>	 !log cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'prometheus2004*' 'run-puppet-agent --enable "staged rollout T222105 by cdanis"'                                             
[14:24:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:34] <stashbot>	 T222105: prometheus: current query limits are insufficient to prevent OOMs - https://phabricator.wikimedia.org/T222105
[14:26:33] <jbond42>	 !log disable-puppet "T220987: global kafaka log shipping - staged rollout (jbond)"
[14:26:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:26:39] <stashbot>	 T220987: Ferm: send ferm/iptables/ulogd logs to Kafaka/logstash/elasticsearch - https://phabricator.wikimedia.org/T220987
[14:28:30] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] wm.org: 1h non-dyna records for foo.dcname entries [dns] - 10https://gerrit.wikimedia.org/r/501615 (owner: 10BBlack)
[14:31:03] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] kafka shipper: move kafka rsyslog shipping to base profile [puppet] - 10https://gerrit.wikimedia.org/r/505737 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond)
[14:31:23] <wikibugs>	 (03PS7) 10Jbond: kafka shipper: move kafka rsyslog shipping to base profile [puppet] - 10https://gerrit.wikimedia.org/r/505737 (https://phabricator.wikimedia.org/T220987)
[14:31:27] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] kafka shipper: move kafka rsyslog shipping to base profile [puppet] - 10https://gerrit.wikimedia.org/r/505737 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond)
[14:32:26] <wikibugs>	 (03PS1) 10BBlack: CAA: Add LE to issuewild for policy.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/507330
[14:32:57] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] CAA: Add LE to issuewild for policy.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/507330 (owner: 10BBlack)
[14:33:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "I've updated the check_conntrack run book a bit." [puppet] - 10https://gerrit.wikimedia.org/r/506550 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn)
[14:34:17] <wikibugs>	 (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/16212/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507306 (owner: 10Elukey)
[14:34:19] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Deprecate GELF logging [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507297 (owner: 10Elukey)
[14:34:31] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Enable hdfs-audit log [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507306 (owner: 10Elukey)
[14:35:31] <wikibugs>	 (03CR) 10Mathew.onipe: Add postgres slave init cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe)
[14:36:17] <wikibugs>	 (03PS1) 10Elukey: Update cdh module to its latest version [puppet] - 10https://gerrit.wikimedia.org/r/507333
[14:38:51] <wikibugs>	 (03CR) 10Volans: [C: 03+2] cookbook API: drop get_title() support [software/spicerack] - 10https://gerrit.wikimedia.org/r/506955 (owner: 10Volans)
[14:39:40] <wikibugs>	 (03PS6) 10Mathew.onipe: Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946)
[14:41:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe)
[14:42:18] <wikibugs>	 (03PS7) 10Mathew.onipe: Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946)
[14:43:25] <cdanis>	 !log cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'bast5001*' 'run-puppet-agent --enable "staged rollout T222105 by cdanis"'                                   
[14:43:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:29] <stashbot>	 T222105: prometheus: current query limits are insufficient to prevent OOMs - https://phabricator.wikimedia.org/T222105
[14:43:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe)
[14:44:04] <wikibugs>	 (03Merged) 10jenkins-bot: cookbook API: drop get_title() support [software/spicerack] - 10https://gerrit.wikimedia.org/r/506955 (owner: 10Volans)
[14:44:15] <jijiki>	 !log Sending 1% of anonymous users to PHP7.2 - T219150
[14:44:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:19] <stashbot>	 T219150: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150
[14:44:35] <wikibugs>	 (03PS2) 10Elukey: Update cdh module to its latest version [puppet] - 10https://gerrit.wikimedia.org/r/507333 (https://phabricator.wikimedia.org/T220702)
[14:44:59] <wikibugs>	 (03CR) 10jenkins-bot: cookbook API: drop get_title() support [software/spicerack] - 10https://gerrit.wikimedia.org/r/506955 (owner: 10Volans)
[14:46:33] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] Send 1% of anonymous users to PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506953 (https://phabricator.wikimedia.org/T219150) (owner: 10Giuseppe Lavagetto)
[14:48:37] <wikibugs>	 (03PS1) 10BBlack: wm.org: add IN where missing on DYNAs [dns] - 10https://gerrit.wikimedia.org/r/507337
[14:49:18] <cdanis>	 !log cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'labmon1001*' 'run-puppet-agent --enable "staged rollout T222105 by cdanis"'                                    
[14:49:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:23] <stashbot>	 T222105: prometheus: current query limits are insufficient to prevent OOMs - https://phabricator.wikimedia.org/T222105
[14:49:51] <wikibugs>	 (03CR) 10Muehlenhoff: "I like the approach, some comments inline" (038 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/506188 (owner: 10Jbond)
[14:50:21] <wikibugs>	 (03PS1) 10Ottomata: eventschemas::service - use relative path to ./repositories [puppet] - 10https://gerrit.wikimedia.org/r/507338 (https://phabricator.wikimedia.org/T219552)
[14:51:29] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] wm.org: add IN where missing on DYNAs [dns] - 10https://gerrit.wikimedia.org/r/507337 (owner: 10BBlack)
[14:52:04] <wikibugs>	 (03PS1) 10Volans: monitoring: detect Puppet dependency cycle failure [puppet] - 10https://gerrit.wikimedia.org/r/507339 (https://phabricator.wikimedia.org/T221784)
[14:53:59] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventschemas::service - use relative path to ./repositories [puppet] - 10https://gerrit.wikimedia.org/r/507338 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata)
[14:54:45] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs: Remove puppet code for the 'main' region [puppet] - 10https://gerrit.wikimedia.org/r/507340 (https://phabricator.wikimedia.org/T167293)
[14:54:48] <wikibugs>	 (03CR) 10Ema: [C: 03+1] "Looks great, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/507339 (https://phabricator.wikimedia.org/T221784) (owner: 10Volans)
[14:55:05] <wikibugs>	 (03PS2) 10Ottomata: eventschemas::service - use relative path to ./repositories [puppet] - 10https://gerrit.wikimedia.org/r/507338 (https://phabricator.wikimedia.org/T219552)
[14:55:41] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventschemas::service - use relative path to ./repositories [puppet] - 10https://gerrit.wikimedia.org/r/507338 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata)
[14:55:48] <wikibugs>	 (03CR) 10Andrew Bogott: "There's no rush to merge this but I'm running it through the puppet compiler to see what we've missed." [puppet] - 10https://gerrit.wikimedia.org/r/507340 (https://phabricator.wikimedia.org/T167293) (owner: 10Andrew Bogott)
[14:56:01] <cdanis>	 !log cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'bast3002*' 'run-puppet-agent --enable "filippo prometheus"'
[14:56:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:29] <wikibugs>	 (03PS3) 10Elukey: Update cdh module to its latest version [puppet] - 10https://gerrit.wikimedia.org/r/507333 (https://phabricator.wikimedia.org/T220702)
[14:56:42] <wikibugs>	 (03PS1) 10Ottomata: eventschemas::service - Don't render config.js.erb, it is in files/ now [puppet] - 10https://gerrit.wikimedia.org/r/507341 (https://phabricator.wikimedia.org/T219552)
[14:57:17] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventschemas::service - Don't render config.js.erb, it is in files/ now [puppet] - 10https://gerrit.wikimedia.org/r/507341 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata)
[14:58:42] <jbond42>	 !log enable-puppet "T220987: global kafaka log shipping - staged rollout (jbond)"
[14:58:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:45] <stashbot>	 T220987: Ferm: send ferm/iptables/ulogd logs to Kafaka/logstash/elasticsearch - https://phabricator.wikimedia.org/T220987
[15:00:50] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/16213/" [puppet] - 10https://gerrit.wikimedia.org/r/507333 (https://phabricator.wikimedia.org/T220702) (owner: 10Elukey)
[15:00:52] <wikibugs>	 (03PS4) 10Elukey: Update cdh module to its latest version [puppet] - 10https://gerrit.wikimedia.org/r/507333 (https://phabricator.wikimedia.org/T220702)
[15:01:59] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Send 1% of anonymous users to PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506953 (https://phabricator.wikimedia.org/T219150)
[15:03:57] <wikibugs>	 (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Add fake ssh keys for netbox network user [labs/private] - 10https://gerrit.wikimedia.org/r/507231 (owner: 10Ayounsi)
[15:06:23] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] kafka shipper: add ulogd to kafka forwarding rules [puppet] - 10https://gerrit.wikimedia.org/r/505750 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond)
[15:06:34] <wikibugs>	 (03PS5) 10Jbond: kafka shipper: add ulogd to kafka forwarding rules [puppet] - 10https://gerrit.wikimedia.org/r/505750 (https://phabricator.wikimedia.org/T220987)
[15:08:05] <_joe_>	 jijiki: are you deploying?
[15:08:09] <jijiki>	 yes
[15:08:11] <wikibugs>	 10Operations, 10observability, 10Patch-For-Review, 10Wikimedia-Incident: prometheus: current query limits are insufficient to prevent OOMs - https://phabricator.wikimedia.org/T222105 (10CDanis) 05Open→03Resolved a:03CDanis As documented in T222112#5147131 this didn't actually fix the dashboard at fau...
[15:08:13] <jijiki>	 about to
[15:08:13] <_joe_>	 ok
[15:08:21] <_joe_>	 yeah let's please move on
[15:08:38] <_joe_>	 jenkins made us waste 20 minutes
[15:09:23] <wikibugs>	 (03CR) 10jenkins-bot: Send 1% of anonymous users to PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506953 (https://phabricator.wikimedia.org/T219150) (owner: 10Giuseppe Lavagetto)
[15:09:58] <logmsgbot>	 !log jiji@deploy1001 Synchronized wmf-config/CommonSettings.php: Send 1% of anonymous users to PHP7.2 - T219150 (duration: 00m 54s)
[15:10:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:03] <stashbot>	 T219150: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150
[15:11:00] <wikibugs>	 (03PS2) 10Volans: monitoring: detect Puppet dependency cycle failure [puppet] - 10https://gerrit.wikimedia.org/r/507339 (https://phabricator.wikimedia.org/T221784)
[15:12:06] <wikibugs>	 (03CR) 10Volans: [C: 03+2] monitoring: detect Puppet dependency cycle failure [puppet] - 10https://gerrit.wikimedia.org/r/507339 (https://phabricator.wikimedia.org/T221784) (owner: 10Volans)
[15:13:54] <wikibugs>	 (03PS1) 10Herron: exim-minimal: increase localhost exim max connections [puppet] - 10https://gerrit.wikimedia.org/r/507345
[15:16:00] <wikibugs>	 (03CR) 10Paladox: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/507345 (owner: 10Herron)
[15:17:25] <wikibugs>	 (03PS2) 10Ema: Revert "Revert "package_builder: move lintian out of require_package"" [puppet] - 10https://gerrit.wikimedia.org/r/507324 (https://phabricator.wikimedia.org/T221784)
[15:18:19] <wikibugs>	 (03CR) 10Ema: [C: 03+2] Revert "Revert "package_builder: move lintian out of require_package"" [puppet] - 10https://gerrit.wikimedia.org/r/507324 (https://phabricator.wikimedia.org/T221784) (owner: 10Ema)
[15:18:27] <wikibugs>	 (03PS2) 10Herron: exim-minimal: increase localhost exim max connections [puppet] - 10https://gerrit.wikimedia.org/r/507345
[15:18:29] <wikibugs>	 (03PS1) 10Mathew.onipe: wdqs: add WDQS restart cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832)
[15:18:39] <jynus>	 !log stop s8 instance on dbstore2001 for cloning to db2100 T220572 
[15:18:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:43] <stashbot>	 T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host - https://phabricator.wikimedia.org/T220572
[15:20:44] <wikibugs>	 10Operations, 10fundraising-tech-ops, 10Patch-For-Review: rack/setup/install Prometeuse/Grafana host frmon2001 for fr-tech - https://phabricator.wikimedia.org/T196476 (10cwdent)
[15:20:48] <wikibugs>	 10Operations, 10fundraising-tech-ops, 10netops, 10Patch-For-Review: Network setup for frmon2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T221475 (10cwdent) 05Open→03Resolved @Papaul thanks!  Working :)
[15:20:52] <wikibugs>	 (03PS3) 10Herron: exim-minimal: increase localhost exim max connections [puppet] - 10https://gerrit.wikimedia.org/r/507345
[15:21:40] <wikibugs>	 (03CR) 10Herron: [C: 03+2] exim-minimal: increase localhost exim max connections [puppet] - 10https://gerrit.wikimedia.org/r/507345 (owner: 10Herron)
[15:24:44] <wikibugs>	 (03PS2) 10Herron: remove granularity key from wiki-mail DKIM [dns] - 10https://gerrit.wikimedia.org/r/504948 (https://phabricator.wikimedia.org/T221290) (owner: 10Cwhite)
[15:26:46] <wikibugs>	 (03PS8) 10Mathew.onipe: Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946)
[15:28:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] facter3/puppet5: enable puppet5/facter3 eqsin [puppet] - 10https://gerrit.wikimedia.org/r/507299 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond)
[15:28:12] <wikibugs>	 (03PS2) 10Jbond: facter3/puppet5: enable puppet5/facter3 eqsin [puppet] - 10https://gerrit.wikimedia.org/r/507299 (https://phabricator.wikimedia.org/T219803)
[15:31:33] <wikibugs>	 10Operations, 10observability, 10Patch-For-Review: 100% of Prometheus traffic served by Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi)
[15:31:48] <wikibugs>	 10Operations, 10observability, 10Patch-For-Review: 100% of Prometheus traffic served by Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi)
[15:32:59] <icinga-wm>	 PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[rsyslog-kafka]
[15:33:03] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review, 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10Jdforrester-WMF)
[15:33:23] <wikibugs>	 10Operations, 10serviceops, 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10Jdforrester-WMF)
[15:33:39] <wikibugs>	 (03PS6) 10Paladox: scap: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507072 (https://phabricator.wikimedia.org/T218844)
[15:35:22] <jbond42>	 looking at labstore1003 now
[15:41:30] <wikibugs>	 (03PS1) 10Jbond: kafka: disable kafka on trusty [puppet] - 10https://gerrit.wikimedia.org/r/507350
[15:41:44] <wikibugs>	 (03PS1) 10BryanDavis: Revert "striker: Disable developer account creation" [puppet] - 10https://gerrit.wikimedia.org/r/507351 (https://phabricator.wikimedia.org/T219830)
[15:41:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "striker: Disable developer account creation" [puppet] - 10https://gerrit.wikimedia.org/r/507351 (https://phabricator.wikimedia.org/T219830) (owner: 10BryanDavis)
[15:43:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] kafka: disable kafka on trusty [puppet] - 10https://gerrit.wikimedia.org/r/507350 (owner: 10Jbond)
[15:43:57] <wikibugs>	 (03PS2) 10BryanDavis: Revert "striker: Disable developer account creation" [puppet] - 10https://gerrit.wikimedia.org/r/507351 (https://phabricator.wikimedia.org/T219830)
[15:45:36] <wikibugs>	 (03CR) 10CRusnov: "Rambles inline." (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans)
[15:45:36] <elukey>	 !log restart hadoop hdfs namenodes on an-master100[1,2] to pick up new logging settings - T220702
[15:45:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:43] <stashbot>	 T220702: Check if HDFS offers a way to prevent/limit/throttle users to overwhelm the HDFS Namenode - https://phabricator.wikimedia.org/T220702
[15:50:09] <wikibugs>	 (03CR) 10Alex Monk: "I think that apache user was actually in use on precise." [puppet] - 10https://gerrit.wikimedia.org/r/506750 (owner: 10Dzahn)
[15:50:40] <wikibugs>	 10Operations, 10observability, 10Patch-For-Review: 100% of Prometheus traffic served by Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) Labmon hosts are in active/standby pairs for graphite whereas prometheus runs on both independently but only labmon1001 is used for queries via `pro...
[15:50:41] <wikibugs>	 (03CR) 10CRusnov: [C: 04-1] "Looks good, minor nit." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507224 (owner: 10Ayounsi)
[15:51:09] <wikibugs_>	 (03CR) 10Alex Monk: "T78076 was related" [puppet] - 10https://gerrit.wikimedia.org/r/506750 (owner: 10Dzahn)
[15:54:48] <wikibugs_>	 10Operations, 10decommission: Decommission labservices1001, 1002 - https://phabricator.wikimedia.org/T221857 (10RobH) p:05Triage→03Normal
[15:55:09] <wikibugs_>	 (03PS1) 10Ottomata: Include profile::standard and base::firewall in role::eventschemas::service [puppet] - 10https://gerrit.wikimedia.org/r/507353 (https://phabricator.wikimedia.org/T219556)
[15:55:18] <wikibugs_>	 10Operations, 10decommission: Decommission labservices1001, 1002 - https://phabricator.wikimedia.org/T221857 (10RobH)
[15:55:48] <wikibugs>	 10Operations, 10decommission: Decommission labservices1001 & labservices1002 - https://phabricator.wikimedia.org/T221857 (10RobH)
[15:56:10] <wikibugs>	 (03PS2) 10Ottomata: Include profile::standard and base::firewall in role::eventschemas::service [puppet] - 10https://gerrit.wikimedia.org/r/507353 (https://phabricator.wikimedia.org/T219556)
[15:57:09] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Include profile::standard and base::firewall in role::eventschemas::service [puppet] - 10https://gerrit.wikimedia.org/r/507353 (https://phabricator.wikimedia.org/T219556) (owner: 10Ottomata)
[15:57:15] <wikibugs>	 (03CR) 10Paladox: [C: 03+1] wikitech: Provision gerrit api auth credentials [puppet] - 10https://gerrit.wikimedia.org/r/506588 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis)
[15:57:23] <wikibugs>	 10Operations, 10decommission: Decommission labservices1001 & labservices1002 - https://phabricator.wikimedia.org/T221857 (10RobH) network info:  labservices1001 : asw2-d-eqiad:ge-3/0/9   labservices1002 :  asw2-a-eqiad:ge-4/0/12
[15:58:32] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.decommission
[15:58:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:58:35] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[15:58:38] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
[15:58:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:58:43] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.decommission
[15:58:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:58:47] <wikibugs>	 10Operations, 10decommission: Decommission labservices1001 & labservices1002 - https://phabricator.wikimedia.org/T221857 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `labservices1001.wikimedia.org` -  labservices1001.wikimedia.org   - Removed from Puppet master...
[15:58:49] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
[15:58:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:58:55] <wikibugs>	 10Operations, 10decommission: Decommission labservices1001 & labservices1002 - https://phabricator.wikimedia.org/T221857 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `labservices1002.wikimedia.org` -  labservices1002.wikimedia.org   - Removed from Puppet master...
[16:00:04] <jouncebot>	 godog and _joe_: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190430T1600).
[16:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[16:00:16] <wikibugs>	 (03PS1) 10RobH: decom labservices100[12] references [puppet] - 10https://gerrit.wikimedia.org/r/507354 (https://phabricator.wikimedia.org/T221857)
[16:01:31] <wikibugs>	 (03CR) 10RobH: [C: 03+2] decom labservices100[12] references [puppet] - 10https://gerrit.wikimedia.org/r/507354 (https://phabricator.wikimedia.org/T221857) (owner: 10RobH)
[16:01:39] <wikibugs>	 (03PS2) 10RobH: decom labservices100[12] references [puppet] - 10https://gerrit.wikimedia.org/r/507354 (https://phabricator.wikimedia.org/T221857)
[16:03:15] <wikibugs>	 (03PS2) 10RobH: decom labnet100[12] production dns [dns] - 10https://gerrit.wikimedia.org/r/506574 (https://phabricator.wikimedia.org/T221818)
[16:03:54] <wikibugs>	 (03CR) 10RobH: [C: 03+2] decom labnet100[12] production dns [dns] - 10https://gerrit.wikimedia.org/r/506574 (https://phabricator.wikimedia.org/T221818) (owner: 10RobH)
[16:04:50] <ema>	 !log pool cp4022 w/ ATS backend T219967
[16:04:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:04:54] <stashbot>	 T219967: Replace Varnish backends with ATS on cache upload nodes in ulsfo   - https://phabricator.wikimedia.org/T219967
[16:06:30] <wikibugs>	 (03PS1) 10RobH: decom labservices100[12] prod dns [dns] - 10https://gerrit.wikimedia.org/r/507356 (https://phabricator.wikimedia.org/T221857)
[16:06:58] <wikibugs>	 (03CR) 10RobH: [C: 03+2] decom labservices100[12] prod dns [dns] - 10https://gerrit.wikimedia.org/r/507356 (https://phabricator.wikimedia.org/T221857) (owner: 10RobH)
[16:08:01] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[16:08:13] <wikibugs>	 10Operations, 10Performance-Team, 10Thumbor, 10Traffic, 10Patch-For-Review: SwiftMedia URL rewrite returns some 404s with wrong Content-Length - https://phabricator.wikimedia.org/T222071 (10fgiunchedi) I'm wondering if the underlying issue here (copying responses inside `rewrite.py`) could be the culprit...
[16:08:37] <wikibugs>	 10Operations, 10serviceops: TEC3:Q4 Tracking task - https://phabricator.wikimedia.org/T220403 (10thcipriani)
[16:09:19] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[16:09:34] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Minor comments inline" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/507224 (owner: 10Ayounsi)
[16:09:37] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: base: make RAID optional [puppet] - 10https://gerrit.wikimedia.org/r/507357
[16:09:54] <wikibugs>	 10Operations, 10decommission, 10Patch-For-Review: Decommission labservices1001 & labservices1002 - https://phabricator.wikimedia.org/T221857 (10RobH)
[16:10:16] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission: Decommission labservices1001 & labservices1002 - https://phabricator.wikimedia.org/T221857 (10RobH) a:05RobH→03Cmjohnson
[16:11:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/500413 (owner: 10Muehlenhoff)
[16:12:05] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops: Migrate ORES to kubernetes - https://phabricator.wikimedia.org/T220400 (10thcipriani)
[16:12:08] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team: [Epic] Deploy ORES in kubernetes cluster - https://phabricator.wikimedia.org/T182331 (10thcipriani)
[16:12:27] <wikibugs>	 10Operations, 10observability, 10Patch-For-Review: 100% of Prometheus traffic served by Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10aborrero) >>! In T187987#5147514, @fgiunchedi wrote: > Labmon hosts are in active/standby pairs for graphite whereas prometheus runs on both independently but o...
[16:12:40] <wikibugs>	 10Operations, 10Gerrit, 10cloud-services-team, 10serviceops: Change /r/p/ to /r/ on all hosts (where https://gerrit.wikimedia.org/r/p/ exists) - https://phabricator.wikimedia.org/T222093 (10Paladox)
[16:12:47] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 2 others: TEC3:O3:O3.1:Q4 Goal - Move cpjobqueue, Wikidata Termbox SSR (new service), Kask (session storage service) and ORES (partially) through the production CD Pipeline - https://phabricator.wikimedia.org/T220398 (10thcipria...
[16:12:51] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team: [Epic] Deploy ORES in kubernetes cluster - https://phabricator.wikimedia.org/T182331 (10thcipriani)
[16:14:23] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team, 10Release Pipeline (Blubber): Build blubber file for ORES - https://phabricator.wikimedia.org/T210268 (10thcipriani)
[16:16:12] <wikibugs>	 10Operations, 10ORES, 10Release Pipeline, 10Scoring-platform-team, and 2 others: Execution of the deployment pipeline should be configurable via .pipeline/config.yaml - https://phabricator.wikimedia.org/T210267 (10thcipriani)
[16:18:29] <wikibugs>	 (03PS1) 10Ema: cache: add hiera setting for varnish backend restarts [puppet] - 10https://gerrit.wikimedia.org/r/507358 (https://phabricator.wikimedia.org/T219967)
[16:18:47] <wikibugs>	 (03PS1) 10CRusnov: profile spicerack: Add Ganeti module configuration [puppet] - 10https://gerrit.wikimedia.org/r/507359
[16:19:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile spicerack: Add Ganeti module configuration [puppet] - 10https://gerrit.wikimedia.org/r/507359 (owner: 10CRusnov)
[16:19:46] <volans>	 you need to specify it's a dir
[16:19:54] <wikibugs>	 10Operations, 10Mail: Gmail - Multiple destination domains per transaction is unsupported. Please try again. - https://phabricator.wikimedia.org/T222198 (10herron) p:05Triage→03Normal
[16:19:56] <volans>	 ops wrong chan
[16:20:46] <wikibugs>	 (03PS2) 10CRusnov: profile spicerack: Add Ganeti module configuration [puppet] - 10https://gerrit.wikimedia.org/r/507359
[16:26:08] <wikibugs>	 (03PS19) 10Volans: Netbox module for Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov)
[16:26:41] <wikibugs>	 (03PS3) 10CRusnov: profile spicerack: Add Ganeti module configuration [puppet] - 10https://gerrit.wikimedia.org/r/507359
[16:26:52] <wikibugs>	 10Operations, 10observability, 10Wikimedia-Incident: prometheus: upgrade to 2.9.2 - https://phabricator.wikimedia.org/T222113 (10fgiunchedi) +1 to testing/PoC 2.9.2; we're using Debian Prometheus packages mostly verbatim, but adding back the k8s discovery + dependencies back as they are not shipped in Debian...
[16:26:55] <jbond42>	 !log upgrade puppet and facter in eqsin
[16:26:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:07] <icinga-wm>	 PROBLEM - Check correctness of the icinga configuration on icinga1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga
[16:27:26] <wikibugs>	 (03CR) 10Ema: [C: 03+2] cache: add hiera setting for varnish backend restarts [puppet] - 10https://gerrit.wikimedia.org/r/507358 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema)
[16:27:40] <volans>	 ottomata: missing eventschemas_codfw in icinga config ^^^
[16:27:52] <XioNoX>	 !log upgrade librenms to 1.51
[16:27:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:28:17] <ottomata>	 looking
[16:28:29] <volans>	 ottomata: it's the usual hieradata/common/monitoring.yaml
[16:28:41] <ottomata>	 not sure i know this 'usual' :p
[16:29:14] <volans>	 any new value of "$cluster_$dc" must be defined there
[16:29:34] <volans>	 or icinga complains
[16:29:43] <ottomata>	 hm....
[16:30:16] <wikibugs>	 10Operations, 10Core Platform Team Kanban (Blocked Externally), 10Services (blocked), 10User-Eevans, 10User-fgiunchedi: New upstream jvm-tools - https://phabricator.wikimedia.org/T178839 (10fgiunchedi) >>! In T178839#5144895, @mobrovac wrote: > @Eevans @fgiunchedi is there a plan to resume this work or s...
[16:30:18] <volans>	 cluster groups are defined as $cluster_$dc
[16:31:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Netbox module for Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov)
[16:32:52] <wikibugs>	 (03PS1) 10Ottomata: Add eventschemas clusters in monitoring.yaml [puppet] - 10https://gerrit.wikimedia.org/r/507364 (https://phabricator.wikimedia.org/T219552)
[16:33:56] <volans>	 ottomata: that's not for the svc. stuff, but for the hosts
[16:34:01] <volans>	 the schemaNNNN hosts
[16:34:08] <volans>	 they have
[16:34:08] <volans>	 hostgroups                     eventschemas_eqiad
[16:34:11] <volans>	 in icinga config
[16:34:30] <volans>	 just FYI, the comment seem a bit confusing
[16:34:44] <wikibugs>	 (03PS1) 10Herron: mx: disable multi_domain in smtp transports [puppet] - 10https://gerrit.wikimedia.org/r/507365 (https://phabricator.wikimedia.org/T222198)
[16:36:20] <wikibugs>	 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10fgiunchedi) FYI the upgrade seems to be generating cronspam, in the form of facter warnings:  `lines=5 Subject: Cron <root@cp5001> /usr/local/sbin/smart-data-du...
[16:36:43] <logmsgbot>	 !log ayounsi@deploy1001 Started deploy [librenms/librenms@2094575]: Upgrade LibreNMS to 1.51 - T207706
[16:36:45] <wikibugs>	 (03PS2) 10Ottomata: Add eventschemas clusters in monitoring.yaml [puppet] - 10https://gerrit.wikimedia.org/r/507364 (https://phabricator.wikimedia.org/T219552)
[16:36:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:36:47] <stashbot>	 T207706: LibreNMS upgrade to 1.49 - https://phabricator.wikimedia.org/T207706
[16:36:52] <logmsgbot>	 !log ayounsi@deploy1001 Finished deploy [librenms/librenms@2094575]: Upgrade LibreNMS to 1.51 - T207706 (duration: 00m 11s)
[16:36:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:38:07] <wikibugs>	 (03PS20) 10CRusnov: Netbox module for Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072)
[16:38:56] <wikibugs>	 (03CR) 10CRusnov: "https://puppet-compiler.wmflabs.org/compiler1001/16222/cumin1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/507359 (owner: 10CRusnov)
[16:39:07] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2014 is OK: OK - running: The system is fully operational
[16:40:13] <icinga-wm>	 PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[16:40:15] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "minor comments inline, otherwise LGTM" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) (owner: 10Mathew.onipe)
[16:42:37] <icinga-wm>	 PROBLEM - LibreNMS HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 286 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/LibreNMS
[16:42:40] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/506378 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe)
[16:42:51] <wikibugs>	 (03PS5) 10Jcrespo: mariadb-backups: Setup new backup source hosts for codfw backups [puppet] - 10https://gerrit.wikimedia.org/r/506947 (https://phabricator.wikimedia.org/T206203)
[16:42:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Netbox module for Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov)
[16:44:50] <wikibugs>	 (03PS6) 10Jcrespo: mariadb-backups: Setup new backup source hosts for codfw backups [puppet] - 10https://gerrit.wikimedia.org/r/506947 (https://phabricator.wikimedia.org/T206203)
[16:46:15] <icinga-wm>	 ACKNOWLEDGEMENT - LibreNMS HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 286 bytes in 0.010 second response time Ayounsi Upgrading to 1.51 https://wikitech.wikimedia.org/wiki/LibreNMS
[16:46:31] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: base: make RAID optional [puppet] - 10https://gerrit.wikimedia.org/r/507357
[16:46:37] <wikibugs>	 (03CR) 10CRusnov: "Some initial comments inline. Basically we need to get our crap together configuration-wise :)" (032 comments) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/507217 (owner: 10Ayounsi)
[16:47:54] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: base: make RAID optional [puppet] - 10https://gerrit.wikimedia.org/r/507357
[16:47:57] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Setup new backup source hosts for codfw backups [puppet] - 10https://gerrit.wikimedia.org/r/506947 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo)
[16:49:01] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: base: make RAID optional [puppet] - 10https://gerrit.wikimedia.org/r/507357
[16:50:21] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] base: make RAID optional [puppet] - 10https://gerrit.wikimedia.org/r/507357 (owner: 10Arturo Borrero Gonzalez)
[16:50:45] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC is OK: https://puppet-compiler.wmflabs.org/compiler1002/16221/ even though it didn't finish because the compilation server run out of " [puppet] - 10https://gerrit.wikimedia.org/r/507357 (owner: 10Arturo Borrero Gonzalez)
[16:51:23] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/507359 (owner: 10CRusnov)
[16:51:46] <wikibugs>	 (03PS4) 10CRusnov: profile spicerack: Add Ganeti module configuration [puppet] - 10https://gerrit.wikimedia.org/r/507359
[16:51:56] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "some more issues" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe)
[16:52:25] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, nit inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507364 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata)
[16:52:26] <arturo>	 !log merging change to `profile::base` and `::raid` https://gerrit.wikimedia.org/r/c/operations/puppet/+/507357 related to T221225
[16:52:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:52:31] <stashbot>	 T221225: sssd integration needs to be updated to include sudo config from LDAP support - https://phabricator.wikimedia.org/T221225
[16:52:46] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] profile spicerack: Add Ganeti module configuration [puppet] - 10https://gerrit.wikimedia.org/r/507359 (owner: 10CRusnov)
[16:54:09] <wikibugs>	 (03CR) 10Fsero: [C: 04-1] "LGTM overall, but please change the metric-config configmap name to include wmf.releasename also" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/506166 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata)
[16:55:55] <wikibugs>	 (03PS11) 10MacFan4000: Set wgNoticeProjects for wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471663 (https://phabricator.wikimedia.org/T208694)
[16:56:07] <wikibugs>	 (03PS21) 10CRusnov: Netbox module for Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072)
[16:57:36] <logmsgbot>	 !log ayounsi@deploy1001 Started deploy [librenms/librenms@0fd8da6]: Rollback LibreNMS
[16:57:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:57:41] <logmsgbot>	 !log ayounsi@deploy1001 Finished deploy [librenms/librenms@0fd8da6]: Rollback LibreNMS (duration: 00m 09s)
[16:57:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:58:19] <icinga-wm>	 RECOVERY - LibreNMS HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 200 OK - 8737 bytes in 0.031 second response time https://wikitech.wikimedia.org/wiki/LibreNMS
[16:58:56] <wikibugs>	 (03CR) 10BryanDavis: wikitech: Disable Gerrit accounts when blocked on wikitech (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506587 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis)
[16:59:50] <wikibugs>	 (03PS1) 10CRusnov: Add emacs ignores to gitignore [software/spicerack] - 10https://gerrit.wikimedia.org/r/507370
[17:00:05] <jouncebot>	 cscott, arlolra, subbu, and halfak: How many deployers does it take to do Services – Graphoid / Parsoid / Citoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190430T1700).
[17:00:22] <Zppix>	 All of them
[17:02:03] <wikibugs>	 (03CR) 10Paladox: [C: 03+1] "What if the user is unblocked? Do we want users to create tasks on phabricator to ask to be unblocked from gerrit?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506587 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis)
[17:03:20] <wikibugs>	 (03CR) 10Reedy: "Their phab account is probably disabled/blocked too. See the code above" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506587 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis)
[17:06:04] <wikibugs>	 (03CR) 10Paladox: [C: 03+1] "> Their phab account is probably disabled/blocked too. See the code" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506587 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis)
[17:06:06] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Hide diffs on files that contain passwords for puppet [puppet] - 10https://gerrit.wikimedia.org/r/507373
[17:06:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Hide diffs on files that contain passwords for puppet [puppet] - 10https://gerrit.wikimedia.org/r/507373 (owner: 10Jcrespo)
[17:07:13] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: Revert "Revert "sudo: decouple sudo from sudo-ldap"" [puppet] - 10https://gerrit.wikimedia.org/r/507376 (https://phabricator.wikimedia.org/T221225)
[17:07:37] <icinga-wm>	 RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational
[17:08:27] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Hide diffs on files that contain passwords for puppet [puppet] - 10https://gerrit.wikimedia.org/r/507373
[17:09:33] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the quick fix! :)" [puppet] - 10https://gerrit.wikimedia.org/r/507373 (owner: 10Jcrespo)
[17:10:25] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Hide diffs on files that contain passwords for puppet [puppet] - 10https://gerrit.wikimedia.org/r/507373
[17:11:24] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb: Hide diffs on files that contain passwords for puppet [puppet] - 10https://gerrit.wikimedia.org/r/507373 (owner: 10Jcrespo)
[17:11:48] <logmsgbot>	 !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@1f09e44]: Update mobileapps to 142ba30 (T217837)
[17:11:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:55] <stashbot>	 T217837: [BUG] mobile-html article body has wrong background color - https://phabricator.wikimedia.org/T217837
[17:13:31] <wikibugs>	 (03Abandoned) 10Lucas Werkmeister (WMDE): Enable suggestion constraint status on Beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504389 (https://phabricator.wikimedia.org/T221107) (owner: 10Lucas Werkmeister (WMDE))
[17:15:47] <arturo>	 herron: you around?
[17:16:04] <logmsgbot>	 !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@1f09e44]: Update mobileapps to 142ba30 (T217837) (duration: 04m 16s)
[17:16:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:16:18] <herron>	 arturo: hey
[17:16:56] <arturo>	 herron: PCC jenkins worker nodes are out of disk space. Do you know how to handle that, or if simply rm -rf the output/ dir?
[17:17:36] <herron>	 arturo: sure I can do some cleanup
[17:17:48] <herron>	 just created a task about this yesterday as well
[17:17:53] <arturo>	 herron: cool thanks!
[17:19:21] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10User-fsero: placeholder task for migration problems - https://phabricator.wikimedia.org/T222210 (10fsero)
[17:20:21] <volans>	 ottomata: is icinga fixed? I got sidetracked
[17:21:26] <herron>	 arturo: should be good to go now
[17:21:34] <ottomata>	 volans:  sorry was in meeting
[17:22:00] <arturo>	 herron: thanks!! 
[17:22:06] <herron>	 np!
[17:22:07] <ottomata>	 volans:  aren't all the comments misleading then?
[17:22:19] <ottomata>	 oh i see what you mean
[17:23:12] <wikibugs>	 (03PS3) 10Ottomata: Add eventschemas clusters in monitoring.yaml [puppet] - 10https://gerrit.wikimedia.org/r/507364 (https://phabricator.wikimedia.org/T219552)
[17:24:22] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/507364 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata)
[17:25:24] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Add eventschemas clusters in monitoring.yaml [puppet] - 10https://gerrit.wikimedia.org/r/507364 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata)
[17:25:31] <wikibugs>	 (03PS4) 10Ottomata: Add eventschemas clusters in monitoring.yaml [puppet] - 10https://gerrit.wikimedia.org/r/507364 (https://phabricator.wikimedia.org/T219552)
[17:25:33] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add eventschemas clusters in monitoring.yaml [puppet] - 10https://gerrit.wikimedia.org/r/507364 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata)
[17:32:22] <wikibugs>	 (03PS1) 10CRusnov: profile::ganeti: Add cumin hosts to RAPI and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/507384
[17:33:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::ganeti: Add cumin hosts to RAPI and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/507384 (owner: 10CRusnov)
[17:33:37] <icinga-wm>	 PROBLEM - puppet last run on kubestagetcd1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:34:03] <volans>	 ottomata: you didn't run puppet on icinga right?
[17:34:31] <ottomata>	 no
[17:34:33] <ottomata>	 runnign now
[17:38:55] <icinga-wm>	 RECOVERY - Check correctness of the icinga configuration on icinga1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga
[17:39:05] <volans>	 \o/ thanks
[17:41:08] <wikibugs>	 (03PS1) 10CDanis: swift: hiera-ize object-replicator concurrency [puppet] - 10https://gerrit.wikimedia.org/r/507386
[17:41:20] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+1] wikitech: Disable Gerrit accounts when blocked on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506587 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis)
[17:41:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] swift: hiera-ize object-replicator concurrency [puppet] - 10https://gerrit.wikimedia.org/r/507386 (owner: 10CDanis)
[17:43:59] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: Revert "Revert "sudo: decouple sudo from sudo-ldap"" [puppet] - 10https://gerrit.wikimedia.org/r/507376 (https://phabricator.wikimedia.org/T221225)
[17:45:29] <wikibugs>	 (03PS2) 10CDanis: swift: hiera-ize object-replicator concurrency [puppet] - 10https://gerrit.wikimedia.org/r/507386
[17:47:35] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: Revert "Revert "sudo: decouple sudo from sudo-ldap"" [puppet] - 10https://gerrit.wikimedia.org/r/507376 (https://phabricator.wikimedia.org/T221225)
[17:50:01] <wikibugs>	 (03PS2) 10CRusnov: profile::ganeti: Add cumin hosts to RAPI and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/507384
[17:51:32] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: Revert "Revert "sudo: decouple sudo from sudo-ldap"" [puppet] - 10https://gerrit.wikimedia.org/r/507376 (https://phabricator.wikimedia.org/T221225)
[17:52:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "sudo: decouple sudo from sudo-ldap"" [puppet] - 10https://gerrit.wikimedia.org/r/507376 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez)
[17:53:27] <wikibugs>	 (03PS1) 10CRusnov: Ganeti module: Add timeout support [software/spicerack] - 10https://gerrit.wikimedia.org/r/507390
[17:53:43] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: Revert "Revert "sudo: decouple sudo from sudo-ldap"" [puppet] - 10https://gerrit.wikimedia.org/r/507376 (https://phabricator.wikimedia.org/T221225)
[17:54:08] <wikibugs>	 (03CR) 10CDanis: "PCC looks fine https://puppet-compiler.wmflabs.org/compiler1002/16238/" [puppet] - 10https://gerrit.wikimedia.org/r/507386 (owner: 10CDanis)
[17:54:35] <wikibugs>	 (03PS1) 10Ayounsi: Add PHP 7.2 to LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/507391 (https://phabricator.wikimedia.org/T207706)
[17:56:48] <wikibugs>	 (03PS2) 10Ayounsi: Add PHP 7.2 to LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/507391 (https://phabricator.wikimedia.org/T207706)
[17:57:20] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "thanks for adding them! lgtm afaict. Volans is expert though" [puppet] - 10https://gerrit.wikimedia.org/r/507312 (https://phabricator.wikimedia.org/T219556) (owner: 10Ottomata)
[17:58:46] <wikibugs>	 (03CR) 10Volans: "I'm not familiar with Ganeti RAPI and how much time different calls take, but I'm not sure we need all this boilerplate and overhead to ju" [software/spicerack] - 10https://gerrit.wikimedia.org/r/507390 (owner: 10CRusnov)
[17:58:55] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "perfect, thanks Moritz" [puppet] - 10https://gerrit.wikimedia.org/r/506550 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn)
[17:59:02] <wikibugs>	 (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/16242/netmon1002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/507391 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi)
[17:59:07] <wikibugs>	 (03PS6) 10Dzahn: base::firewall: add runbooks for check_ferm and check_conntrack [puppet] - 10https://gerrit.wikimedia.org/r/506550 (https://phabricator.wikimedia.org/T197873)
[17:59:29] <wikibugs>	 (03PS3) 10CRusnov: profile::ganeti: Add cumin hosts to RAPI and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/507384
[17:59:33] <wikibugs>	 (03PS6) 10Arturo Borrero Gonzalez: Revert "Revert "sudo: decouple sudo from sudo-ldap"" [puppet] - 10https://gerrit.wikimedia.org/r/507376 (https://phabricator.wikimedia.org/T221225)
[17:59:47] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Syntax wise they are correct." [puppet] - 10https://gerrit.wikimedia.org/r/507312 (https://phabricator.wikimedia.org/T219556) (owner: 10Ottomata)
[18:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190430T1800)
[18:01:18] <wikibugs>	 (03CR) 10CRusnov: "> Patch Set 1:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/507390 (owner: 10CRusnov)
[18:02:31] <wikibugs>	 (03PS7) 10Arturo Borrero Gonzalez: Revert "Revert "sudo: decouple sudo from sudo-ldap"" [puppet] - 10https://gerrit.wikimedia.org/r/507376 (https://phabricator.wikimedia.org/T221225)
[18:03:00] <Reedy>	 jouncebot: now
[18:03:00] <jouncebot>	 For the next 0 hour(s) and 56 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190430T1800)
[18:03:02] <Reedy>	 jouncebot: next
[18:03:02] <jouncebot>	 In 0 hour(s) and 56 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190430T1900)
[18:04:31] <icinga-wm>	 RECOVERY - puppet last run on kubestagetcd1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[18:04:43] <James_F>	 Reedy: You wanting to do something?
[18:04:59] <Reedy>	 James_F: The same thing I do every night
[18:05:05] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[18:06:19] <wikibugs>	 (03CR) 10CRusnov: "PCC output" [puppet] - 10https://gerrit.wikimedia.org/r/507384 (owner: 10CRusnov)
[18:08:34] <wikibugs>	 (03PS3) 10CDanis: swift: hiera-ize object-replicator concurrency [puppet] - 10https://gerrit.wikimedia.org/r/507386
[18:09:18] <thcipriani>	 !log start branchcut for 1.34.0-wmf.3
[18:09:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:09:28] <wikibugs>	 (03PS8) 10Arturo Borrero Gonzalez: sudo: decouple sudo from sudo-ldap [puppet] - 10https://gerrit.wikimedia.org/r/507376 (https://phabricator.wikimedia.org/T221225)
[18:10:18] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] swift: hiera-ize object-replicator concurrency [puppet] - 10https://gerrit.wikimedia.org/r/507386 (owner: 10CDanis)
[18:10:36] <wikibugs>	 (03CR) 10Dzahn: kafka: add icinga notes URLs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506719 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn)
[18:11:05] <wikibugs>	 (03PS3) 10Dzahn: kafka: add icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/506719 (https://phabricator.wikimedia.org/T197873)
[18:11:07] <wikibugs>	 (03PS1) 10CRusnov: Add more emacs things to .gitignore [puppet] - 10https://gerrit.wikimedia.org/r/507393
[18:11:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] kafka: add icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/506719 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn)
[18:12:05] <icinga-wm>	 RECOVERY - Long running screen/tmux on ganeti2003 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens
[18:12:25] <wikibugs>	 10Operations, 10observability, 10Patch-For-Review: LibreNMS upgrade to 1.49 - https://phabricator.wikimedia.org/T207706 (10ayounsi) Added doc on how to upgrade LibreNMS https://wikitech.wikimedia.org/wiki/LibreNMS#Upgrade_LibreNMS
[18:12:39] <wikibugs>	 10Operations, 10observability, 10Patch-For-Review: LibreNMS upgrade to 1.51 - https://phabricator.wikimedia.org/T207706 (10ayounsi)
[18:14:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, maybe also designate a canary?" [puppet] - 10https://gerrit.wikimedia.org/r/507312 (https://phabricator.wikimedia.org/T219556) (owner: 10Ottomata)
[18:14:09] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[18:14:20] <wikibugs>	 (03PS1) 10CDanis: swift: codfw: bump replicate concurrency for decomm hosts [puppet] - 10https://gerrit.wikimedia.org/r/507396 (https://phabricator.wikimedia.org/T221068)
[18:14:59] <mutante>	 18:11:13 remote: aborting due to possible repository corruption on the remote side.
[18:15:02] <mutante>	 18:11:13 fatal: protocol error: bad pack header
[18:15:10] <mutante>	 hrmm.... CI 
[18:15:26] <cdanis>	 hmm
[18:15:28] <cdanis>	 are you sure it was't out of disk space on the local side?
[18:15:41] <chaomodus>	 erf
[18:15:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "One nit, but looks good" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507391 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi)
[18:16:03] <mutante>	 hmm. no. let me try it again
[18:16:15] <wikibugs>	 (03PS4) 10Dzahn: kafka: add icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/506719 (https://phabricator.wikimedia.org/T197873)
[18:16:27] <wikibugs>	 (03CR) 10Jbond: "Thanks for this i have been thinking of similar things in the back of my mind.  will give a proper review tomorrow. My initial comment is " (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto)
[18:17:42] <wikibugs>	 (03PS2) 10Rush: wikitech: Provision gerrit api auth credentials [puppet] - 10https://gerrit.wikimedia.org/r/506588 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis)
[18:18:40] <mutante>	 yea, could not reproduce. works again
[18:18:42] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] "PCC looks good https://puppet-compiler.wmflabs.org/compiler1002/16245/" [puppet] - 10https://gerrit.wikimedia.org/r/507396 (https://phabricator.wikimedia.org/T221068) (owner: 10CDanis)
[18:18:50] <wikibugs>	 (03CR) 10Rush: [C: 03+2] wikitech: Provision gerrit api auth credentials [puppet] - 10https://gerrit.wikimedia.org/r/506588 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis)
[18:19:39] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] kafka: add icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/506719 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn)
[18:19:49] <wikibugs>	 (03PS5) 10Dzahn: kafka: add icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/506719 (https://phabricator.wikimedia.org/T197873)
[18:20:41] <wikibugs>	 (03PS2) 10CDanis: swift: codfw: bump replicate concurrency for decomm hosts [puppet] - 10https://gerrit.wikimedia.org/r/507396 (https://phabricator.wikimedia.org/T221068)
[18:20:43] <wikibugs>	 (03CR) 10Ayounsi: Netbox: add j2nb support (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/507224 (owner: 10Ayounsi)
[18:20:44] <icinga-wm>	 PROBLEM - puppet last run on labweb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:21:49] <wikibugs>	 (03PS3) 10Ayounsi: Netbox: add j2nb support [puppet] - 10https://gerrit.wikimedia.org/r/507224
[18:22:01] <mutante>	 cdanis: you are first in line to merge
[18:22:05] <cdanis>	 I'm merging now
[18:22:10] <cdanis>	 puppet-merge'ing rather
[18:22:14] <mutante>	 submits
[18:22:25] <mutante>	 rebases :)
[18:22:28] <wikibugs>	 (03PS6) 10Dzahn: kafka: add icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/506719 (https://phabricator.wikimedia.org/T197873)
[18:22:40] <cdanis>	 I think I bumped into bd.808 as well
[18:22:40] <thcipriani>	 anyone know what's happening with mediawiki/extensions/JADE being renamed? currently breaking branch-cut since it wasn't updated in tools-release repo.
[18:23:04] <icinga-wm>	 PROBLEM - puppet last run on labweb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:23:41] <paladox>	 thcipriani it's now at https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/extensions/Jade
[18:24:08] <cdanis>	 !log running puppet on ms-be2014 to bump replication concurrency T221068
[18:24:09] <thcipriani>	 is this in use? I don't know what's going to happen when we update localization
[18:24:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:24:12] <stashbot>	 T221068: decom ms-be201[345] - https://phabricator.wikimedia.org/T221068
[18:24:14] <jbond42>	 cdanis: ^^ could that be related to your puppet-merge https://phabricator.wikimedia.org/T221529#5143984
[18:24:21] <paladox>	 thcipriani the other repo is archived
[18:24:24] <paladox>	 (read only)
[18:24:33] <paladox>	 aka https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/extensions/JADE
[18:24:54] <wikibugs>	 (03PS2) 10Dzahn: transparency report: allow members of LDAP 'nda' to see private site [puppet] - 10https://gerrit.wikimedia.org/r/506848 (https://phabricator.wikimedia.org/T221744)
[18:24:56] <Reedy>	 thcipriani: Looks like neither repo has any merges recnetly
[18:25:19] <Reedy>	 And 1.34.0-wmf.1 used JADE not Jade anyway
[18:25:34] <Reedy>	 Some documentation being all that got missed out
[18:25:35] <Reedy>	 https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/Jade/+/5221a7a8ccf31a27870034de20073789a828b44e%5E%21/
[18:25:54] <icinga-wm>	 RECOVERY - Long running screen/tmux on notebook1003 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens
[18:26:27] <Reedy>	 Was it only recently marked readonly though?
[18:26:49] <paladox>	 i think so
[18:26:54] <paladox>	 there's a task for this
[18:27:01] <thcipriani>	 yeah https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/JADE/+/refs/meta/config%5E%21/#F0
[18:27:26] <paladox>	 https://phabricator.wikimedia.org/T221437
[18:27:55] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] transparency report: allow members of LDAP 'nda' to see private site [puppet] - 10https://gerrit.wikimedia.org/r/506848 (https://phabricator.wikimedia.org/T221744) (owner: 10Dzahn)
[18:27:57] <wikibugs>	 (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/16246/netmon1002.wikimedia.org/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507224 (owner: 10Ayounsi)
[18:29:41] <thcipriani>	 I am going to unarchive JADE for the time being, make a new task about how to roll this out via train because I don't feel like it was done properly and I worry about the consequences to l10n.
[18:29:43] <wikibugs>	 (03PS3) 10Ayounsi: Add PHP 7.2 to LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/507391 (https://phabricator.wikimedia.org/T207706)
[18:29:47] <wikibugs>	 (03Abandoned) 10Ammarpad: Increase default thumb size to 260px on Dutch Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490395 (https://phabricator.wikimedia.org/T215106) (owner: 10Ammarpad)
[18:30:35] <wikibugs>	 (03CR) 10Ayounsi: "Thanks." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507391 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi)
[18:31:04] <cdanis>	 jbond42: the catalog fetch fails? possibly
[18:31:14] <cdanis>	 jbond42: I get the sense there's been several merges in a row today
[18:32:28] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: decommission frav1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T222109 (10Dzahn) p:05Triage→03Normal
[18:35:21] <jbond42>	 cdanis: sorry i thought you missed y comment so i didn;t follow up
[18:35:36] <jbond42>	 in this case i think it was just a coincidence, that box is having problems
[18:35:44] <cdanis>	 oh both the labweb ones?
[18:37:07] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/507312 (https://phabricator.wikimedia.org/T219556) (owner: 10Ottomata)
[18:37:13] <jbond42>	 i just checked 1002, was just waiting to see if aanother patch was comming in, figuered omeone was working on it
[18:37:14] <wikibugs>	 (03PS2) 10Ottomata: Add cumin aliasaes for schema* [puppet] - 10https://gerrit.wikimedia.org/r/507312 (https://phabricator.wikimedia.org/T219556)
[18:37:18] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add cumin aliasaes for schema* [puppet] - 10https://gerrit.wikimedia.org/r/507312 (https://phabricator.wikimedia.org/T219556) (owner: 10Ottomata)
[18:38:26] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "compared to the way PHP7.2 is installed for phabricator on stretch and looks all the same" [puppet] - 10https://gerrit.wikimedia.org/r/507391 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi)
[18:39:12] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "> Is there anything else that needs to be done for the app uses PHP 7.2 or it will pickup the proper version automatically?" [puppet] - 10https://gerrit.wikimedia.org/r/507391 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi)
[18:39:21] <jbond42>	 cdanis: cloud admin are looking into it
[18:40:12] <wikibugs>	 (03CR) 10Ottomata: Refactor eventgate-analytics to eventgate (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/506166 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata)
[18:40:50] <cdanis>	 !log running puppet on ms-be201[3,5] to bump replication concurrency T221068
[18:40:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:40:55] <stashbot>	 T221068: decom ms-be201[345] - https://phabricator.wikimedia.org/T221068
[18:41:28] <icinga-wm>	 RECOVERY - puppet last run on labweb1001 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
[18:42:29] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/16247/netmon1002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/507391 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi)
[18:43:58] <icinga-wm>	 RECOVERY - puppet last run on labweb1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[18:48:18] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] Add PHP 7.2 to LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/507391 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi)
[18:48:28] <wikibugs>	 (03PS4) 10Dzahn: Add PHP 7.2 to LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/507391 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi)
[18:48:58] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[18:49:58] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[18:52:11] <wikibugs>	 (03CR) 10Dzahn: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/507391 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi)
[18:55:58] <wikibugs>	 (03PS7) 10Ottomata: Refactor eventgate-analytics to eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/506166 (https://phabricator.wikimedia.org/T218346)
[18:57:59] <wikibugs>	 (03PS8) 10Ottomata: Refactor eventgate-analytics to eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/506166 (https://phabricator.wikimedia.org/T218346)
[18:58:33] <wikibugs>	 (03CR) 10Ottomata: "Ok, instead of renaming and modifying eventgate-analytics, I've left eventgate-analytics in place until we are done migrating to the new e" [deployment-charts] - 10https://gerrit.wikimedia.org/r/506166 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata)
[18:59:09] <wikibugs>	 (03PS1) 10BBlack: Convert most DYNA into CNAME to wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/507399 (https://phabricator.wikimedia.org/T208263)
[18:59:11] <wikibugs>	 (03PS1) 10BBlack: Change CNAME->DYNA TTLs from 1H to 1D [dns] - 10https://gerrit.wikimedia.org/r/507400 (https://phabricator.wikimedia.org/T208263)
[19:00:04] <jouncebot>	 thcipriani: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train - Americas version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190430T1900).
[19:00:14] * thcipriani works on it
[19:02:06] <wikibugs>	 (03PS9) 10Ottomata: Refactor eventgate-analytics to eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/506166 (https://phabricator.wikimedia.org/T218346)
[19:02:50] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Refactor eventgate-analytics to eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/506166 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata)
[19:03:31] <ottomata>	 thcipriani:  would love to test some things, let me know when you deploy to test wiki
[19:03:39] <thcipriani>	 ottomata: will do
[19:03:42] <ottomata>	 danke
[19:03:49] <thcipriani>	 might be a few though, running a bit behind :(
[19:10:18] <wikibugs>	 (03CR) 10Andrew Bogott: "The compiler found one false positive, but things look good:" [puppet] - 10https://gerrit.wikimedia.org/r/507340 (https://phabricator.wikimedia.org/T167293) (owner: 10Andrew Bogott)
[19:10:46] <ottomata>	 s'ok
[19:12:28] <icinga-wm>	 PROBLEM - puppet last run on puppetmaster2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:15:43] <wikibugs>	 (03PS1) 10Dzahn: netmon: switch from PHP 7.0 to PHP 7.2 for LibreNMS upgrade [puppet] - 10https://gerrit.wikimedia.org/r/507402 (https://phabricator.wikimedia.org/T207706)
[19:19:26] <wikibugs>	 (03PS2) 10Dzahn: netmon: switch from PHP 7.0 to PHP 7.2 for LibreNMS upgrade [puppet] - 10https://gerrit.wikimedia.org/r/507402 (https://phabricator.wikimedia.org/T207706)
[19:21:17] <wikibugs>	 (03PS1) 10Ottomata: eventgate-analytics - Add cirrussearch-request to stream-config.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/507403 (https://phabricator.wikimedia.org/T214080)
[19:21:30] <wikibugs>	 (03CR) 10Dzahn: "> Is there anything else that needs to be done for the app uses PHP 7.2 or it will pickup the proper version automatically?" [puppet] - 10https://gerrit.wikimedia.org/r/507391 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi)
[19:22:01] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics - Add cirrussearch-request to stream-config.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/507403 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata)
[19:22:31] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] netmon: switch from PHP 7.0 to PHP 7.2 for LibreNMS upgrade [puppet] - 10https://gerrit.wikimedia.org/r/507402 (https://phabricator.wikimedia.org/T207706) (owner: 10Dzahn)
[19:23:29] <wikibugs>	 (03CR) 10Dzahn: "You see this makes it easy to revert and go back to 7.0 while we don't have to revert your large patch and can have both packages at the s" [puppet] - 10https://gerrit.wikimedia.org/r/507402 (https://phabricator.wikimedia.org/T207706) (owner: 10Dzahn)
[19:24:34] <logmsgbot>	 !log otto@deploy1001 scap-helm eventgate-analytics upgrade production -f eventgate-analytics/analytics/eventgate-analytics-staging-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging]
[19:24:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:24:45] <logmsgbot>	 !log otto@deploy1001 scap-helm eventgate-analytics upgrade production -f eventgate-analytics/eventgate-analytics-staging-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging]
[19:24:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:25:34] <logmsgbot>	 !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics/eventgate-analytics-staging-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging]
[19:25:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:25:36] <logmsgbot>	 !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed
[19:25:36] <logmsgbot>	 !log otto@deploy1001 scap-helm eventgate-analytics finished
[19:25:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:25:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:26:48] <logmsgbot>	 !log otto@deploy1001 scap-helm eventgate-analytics upgrade production -f eventgate-analytics/eventgate-analytics-eqiad-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: eqiad]
[19:26:49] <logmsgbot>	 !log otto@deploy1001 scap-helm eventgate-analytics cluster eqiad completed
[19:26:49] <logmsgbot>	 !log otto@deploy1001 scap-helm eventgate-analytics finished
[19:26:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:26:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:26:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:27:53] <logmsgbot>	 !log otto@deploy1001 scap-helm eventgate-analytics upgrade production -f eventgate-analytics/eventgate-analytics-codfw-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: codfw]
[19:27:55] <logmsgbot>	 !log otto@deploy1001 scap-helm eventgate-analytics cluster codfw completed
[19:27:55] <logmsgbot>	 !log otto@deploy1001 scap-helm eventgate-analytics finished
[19:27:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:27:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:27:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:28:37] <wikibugs>	 (03PS1) 10Thcipriani: Group0 to 1.34.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507405
[19:31:41] <icinga-wm>	 PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[19:34:03] <icinga-wm>	 RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[19:34:35] <thcipriani>	 looks like we've got a lot of old branches to clean up :\
[19:34:52] <ottomata>	 oh?
[19:35:36] <logmsgbot>	 !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@4360316]: Redeploy GUI for fixes T222133, T222129, T222181, T222182
[19:35:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:35:45] <stashbot>	 T222181: “Edit visually” broken on embed.htm - https://phabricator.wikimedia.org/T222181
[19:35:45] <stashbot>	 T222182: Short URL doesn’t work on embed.html - https://phabricator.wikimedia.org/T222182
[19:35:45] <stashbot>	 T222129: WDQS link back to WDQS from a rendered result doesn't show the SPARQL used to create the report - https://phabricator.wikimedia.org/T222129
[19:35:46] <stashbot>	 T222133: "Edit SPARQL" link is broken in embed.html - https://phabricator.wikimedia.org/T222133
[19:36:40] <thcipriani>	 yeah, scap clean was broken for a bit so we couldn't cleanup old branches as part of train for a while
[19:37:54] <thcipriani>	 now we have branches going back to February all taking up the space of a MW checkout + extensions + l10n :)
[19:38:04] <thcipriani>	 cleaning now
[19:38:05] <ottomata>	 oh my
[19:38:07] <ottomata>	 k
[19:38:10] <ottomata>	 (brb btw)
[19:38:11] <icinga-wm>	 RECOVERY - puppet last run on puppetmaster2001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[19:40:11] <logmsgbot>	 !log thcipriani@deploy1001 Pruned MediaWiki: 1.33.0-wmf.17 (duration: 10m 11s)
[19:40:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:42:11] <wikibugs>	 10Operations, 10TechCom, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Later), and 5 others: Establish an SLA for session storage - https://phabricator.wikimedia.org/T211721 (10Eevans) >>! In T211721#5145739, @aaron wrote: >  > [ ... ] >  > ... My understanding is...
[19:43:18] <wikibugs>	 10Operations, 10TechCom, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Later), and 5 others: Establish an SLA for session storage - https://phabricator.wikimedia.org/T211721 (10Eevans) >>! In T211721#5145107, @EvanProdromou wrote: > On a related note, do we want or...
[19:43:29] <mutante>	 !log switched netmon1002/netmon2001 from PHP 7.0 to 7.2 but reverted because LibreNMS still had an issue with it
[19:43:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:44:20] <logmsgbot>	 !log thcipriani@deploy1001 Pruned MediaWiki: 1.33.0-wmf.18 (duration: 02m 25s)
[19:44:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:44:39] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[19:44:53] <logmsgbot>	 !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@4360316]: Redeploy GUI for fixes T222133, T222129, T222181, T222182 (duration: 09m 17s)
[19:45:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:45:01] <stashbot>	 T222181: “Edit visually” broken on embed.htm - https://phabricator.wikimedia.org/T222181
[19:45:01] <stashbot>	 T222182: Short URL doesn’t work on embed.html - https://phabricator.wikimedia.org/T222182
[19:45:02] <stashbot>	 T222129: WDQS link back to WDQS from a rendered result doesn't show the SPARQL used to create the report - https://phabricator.wikimedia.org/T222129
[19:45:02] <stashbot>	 T222133: "Edit SPARQL" link is broken in embed.html - https://phabricator.wikimedia.org/T222133
[19:47:35] <logmsgbot>	 !log thcipriani@deploy1001 Pruned MediaWiki: 1.33.0-wmf.19 (duration: 02m 24s)
[19:47:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:49:49] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[19:51:19] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul)
[19:52:02] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul)
[19:53:24] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission osm-db200[12] and osm-web200[1234] - https://phabricator.wikimedia.org/T187445 (10Papaul)
[19:53:48] <ottomata>	 baack
[19:55:11] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Reenable notifications for db2098-db2101 [puppet] - 10https://gerrit.wikimedia.org/r/507407 (https://phabricator.wikimedia.org/T220572)
[19:56:29] <logmsgbot>	 !log thcipriani@deploy1001 Pruned MediaWiki: 1.33.0-wmf.20 (duration: 02m 07s)
[19:56:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:58:08] <logmsgbot>	 !log thcipriani@deploy1001 Started scap: testwiki to 1.34.0-wmf.3 and rebuild l10n cache
[19:58:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:58:56] <wikibugs>	 (03PS1) 10Ottomata: eventgate - include error_stream in default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/507409 (https://phabricator.wikimedia.org/T218346)
[19:59:30] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission osm-db200[12] and osm-web200[1234] - https://phabricator.wikimedia.org/T187445 (10Papaul)
[19:59:34] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate - include error_stream in default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/507409 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata)
[20:03:19] <wikibugs>	 (03PS1) 10Ottomata: eventgate - properly indent stream_config [deployment-charts] - 10https://gerrit.wikimedia.org/r/507411 (https://phabricator.wikimedia.org/T218346)
[20:10:25] <wikibugs>	 (03PS2) 10Ottomata: eventgate - properly indent stream_config [deployment-charts] - 10https://gerrit.wikimedia.org/r/507411 (https://phabricator.wikimedia.org/T218346)
[20:11:10] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate - properly indent stream_config [deployment-charts] - 10https://gerrit.wikimedia.org/r/507411 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata)
[20:21:30] <mutante>	 !log netmon1002 - loading PHP 7.2 module to debug issue for librenms. librenms very short downtime
[20:21:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:24:37] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1282 is CRITICAL: CRITICAL - load average: 77.23, 34.66, 21.62
[20:25:51] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 52.05, 22.07, 14.45
[20:26:15] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1285 is CRITICAL: CRITICAL - load average: 64.70, 35.23, 22.35
[20:26:16] <thcipriani>	 cdb rebuild step causing ^ ? maybe?
[20:26:39] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 51.10, 27.14, 18.63
[20:27:03] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1283 is CRITICAL: CRITICAL - load average: 60.65, 29.06, 19.43
[20:27:07] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1290 is CRITICAL: CRITICAL - load average: 60.90, 29.39, 19.31
[20:27:11] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 22.56, 19.89, 14.32
[20:27:11] <thcipriani>	 hrm, nope, at least not for 1285, just hhvm angry
[20:27:13] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1280 is CRITICAL: CRITICAL - load average: 67.60, 32.98, 20.61
[20:27:15] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1282 is OK: OK - load average: 23.00, 30.96, 22.53
[20:27:15] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1287 is CRITICAL: CRITICAL - load average: 61.21, 28.27, 18.72
[20:27:33] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1285 is OK: OK - load average: 27.35, 30.22, 21.60
[20:27:53] <cdanis>	 almost the whole appserver fleet had a big burst of network traffic
[20:27:55] <cdanis>	 in eqiad
[20:27:59] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 24.36, 24.16, 18.25
[20:28:19] <cdanis>	 and a somewhat-lagged burst of CPU load
[20:28:19] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1283 is OK: OK - load average: 31.82, 27.75, 19.77
[20:28:25] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1290 is OK: OK - load average: 28.27, 27.19, 19.43
[20:28:28] <chaomodus>	 bounced back pretty quick
[20:28:33] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1280 is OK: OK - load average: 29.37, 29.19, 20.31
[20:28:35] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1287 is OK: OK - load average: 26.44, 25.08, 18.39
[20:29:26] <logmsgbot>	 !log thcipriani@deploy1001 Finished scap: testwiki to 1.34.0-wmf.3 and rebuild l10n cache (duration: 31m 17s)
[20:29:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:29:35] <cdanis>	 lots of requests taking longer too
[20:29:48] <cdanis>	 https://grafana.wikimedia.org/d/000000580/apache-backend-timing?orgId=1&from=now-3h&to=now
[20:30:10] <ottomata>	 hm
[20:30:20] <chaomodus>	 blip, and then dropping down again
[20:30:31] <cdanis>	 the cpu / latency blip is not over
[20:30:47] <chaomodus>	 oh that's the % over 0.5s graph yeah
[20:30:58] <cdanis>	 the network blip does not correlate with thcipriani's deploy
[20:31:22] <thcipriani>	 should I pause train while you all investigate? Or keep going with group0? currently new version is on testwiki only.
[20:31:23] <chaomodus>	 did a world event just happen? :)
[20:31:32] <cdanis>	 chaomodus: frontend traffic graphs are not elevated
[20:31:43] <chaomodus>	 ah
[20:31:58] <ottomata>	 thcipriani:  testwiki seems to be very slow too?
[20:32:01] <cdanis>	 nor is qps observed at apache, so it isn't a difference in cache-ability at the CDN layer
[20:32:29] <thcipriani>	 ottomata: that's normal, takes a while for hhvm bytecode cache to warm up
[20:33:15] <cdanis>	 the delay from the deploy is really odd IMO
[20:33:45] <thcipriani>	 "a while" == "12 hits per server" or something like that, been a while since I looked into it
[20:34:05] <thcipriani>	 FWIW, I started seeing these when the cdb rebuild part of deployment started
[20:35:27] <cdanis>	 thcipriani: forgive my ignorance; what are the biggest users of CDB inside WMF wikis?
[20:35:38] <cdanis>	 probably the localization messages...?
[20:35:46] <thcipriani>	 cdanis: those are all the l10n messages for the wiki, yeah
[20:35:54] <thcipriani>	 "the wiki" == all wikis
[20:36:06] <cdanis>	 I guess interwiki links get a lot of hits but that db must be much much smaller
[20:37:25] <thcipriani>	 I feel like this is not dissimilar to https://phabricator.wikimedia.org/T204871
[20:37:50] <thcipriani>	 just the other side of that error
[20:38:10] <Reedy>	 cdanis: Interwiki links? they're not cdb...
[20:38:37] <cdanis>	 Reedy: that's not what https://www.mediawiki.org/wiki/CDB suggested, but that's all I know
[20:38:43] <Reedy>	 https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/interwiki.php
[20:38:50] <Reedy>	 !bug 1 | cdanis
[20:38:50] <wm-bot>	 cdanis: https://bugzilla.wikimedia.org/show_bug.cgi?id=1
[20:39:00] <Reedy>	 They've been php for 3 years now
[20:39:01] <Reedy>	 lol
[20:39:04] <cdanis>	 lol
[20:39:17] <Reedy>	 I guess, the page isn't technically incorrect
[20:39:22] <Reedy>	 I think, it can use cdb, but we don't
[20:39:47] <cdanis>	 I don't know MW codebase well enough to code-spelunk quickly
[20:40:07] <Reedy>	  * Interwiki cache, either as an associative array or a path to a constant
[20:40:07] <Reedy>	  * database (.cdb) file.
[20:40:20] <cdanis>	 ahh, amusingly, https://www.mediawiki.org/wiki/Interwiki_cache gets it right
[20:40:23] <cdanis>	 Until 2015, Wikimedia used it to configure the path to a CDB file that is loaded from disk when needed. Since 2016, it is also supported to set $wgInterwikiCache directly to an array. This is typically done by storing the array in a PHP file containing <?php return array( .. ); and loading it in the variable assignment with require.
[20:40:59] <Reedy>	 https://github.com/wikimedia/operations-mediawiki-config/commit/5bc3b88a0488e96b7473c7ceeb815b78ea5e9bb9#diff-be231341e8f4ecc1a4106d690593dac6
[20:42:10] <cdanis>	 interesting, logstash does not show an increase in fatals for the time interval in question here (20:23-20:30 was the worst of the latency spike)
[20:42:57] <wikibugs>	 (03PS3) 10Gehel: Enable revision fetches in production [puppet] - 10https://gerrit.wikimedia.org/r/504990 (https://phabricator.wikimedia.org/T217897) (owner: 10Smalyshev)
[20:43:46] <cdanis>	 I'm not sure if that increase in network traffic beforehand is usual for a deploy, either
[20:43:53] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] Enable revision fetches in production [puppet] - 10https://gerrit.wikimedia.org/r/504990 (https://phabricator.wikimedia.org/T217897) (owner: 10Smalyshev)
[20:44:17] <cdanis>	 but ah well, site seems fine, things look normal-ish now (appserver cluster is maybe using a hair more RAM than before, but not to an alarming degree) 
[20:45:06] <thcipriani>	 huh, we didn't get a 60 second timeout thing, that's...bizarre for a recent deployment? maybe? I guess it's been a few weeks since I did one.
[20:45:12] <cdanis>	 we did! but it was before that
[20:45:19] <thcipriani>	 oh
[20:45:25] <cdanis>	 19:40-19:45 or so
[20:46:15] <wikibugs>	 (03CR) 10Dzahn: "When switching to 7.2 in prod librenms would fail to work. After some debugging (had to turn on error_reporting etc to see what caused the" [puppet] - 10https://gerrit.wikimedia.org/r/507391 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi)
[20:46:24] <cdanis>	 er, 19:30-ish for the start
[20:46:26] <thcipriani>	 ah
[20:46:30] <cdanis>	 which matches up better with your deploy
[20:46:58] <thcipriani>	 pruning wikiversions does sync some code: https://tools.wmflabs.org/sal/log/AWpvwwDJOwpQ-3Pku3J5
[20:47:03] <thcipriani>	 rsync --delete
[20:47:25] <cdanis>	 oh and that will also clear the HHVM bytecode cache?
[20:47:42] <thcipriani>	 unsure
[20:47:52] <thcipriani>	 it hasn't in the past afaicr
[20:48:21] <Reedy>	 for old enough versions of MW, i can't imagine the hhvm bytecode cache still references them
[20:48:30] <thcipriani>	 should mostly be stat-ing a bunch of files and the files its removing shouldn't be in the bytecode ca...yeah ^
[20:48:52] <cdanis>	 yeah
[20:48:54] <cdanis>	 hm
[20:48:56] <cdanis>	 strange.
[20:49:36] <wikibugs>	 (03PS1) 10Dzahn: librenms: ensure php7.2-ldap is installed [puppet] - 10https://gerrit.wikimedia.org/r/507487 (https://phabricator.wikimedia.org/T207706)
[20:50:19] <wikibugs>	 (03PS3) 10ArielGlenn: split up page content jobs with max bytes per page range [dumps] - 10https://gerrit.wikimedia.org/r/507268 (https://phabricator.wikimedia.org/T221504)
[20:50:25] <cdanis>	 the differences in network traffic patterns on the appservers a bit is odd to me
[20:50:46] <wikibugs>	 (03PS2) 10Dzahn: librenms: ensure php7.2-ldap is installed [puppet] - 10https://gerrit.wikimedia.org/r/507487 (https://phabricator.wikimedia.org/T207706)
[20:50:52] <chaomodus>	 how so?
[20:50:56] <cdanis>	 https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-instance=All&from=now-1h&to=now and flop open 'network per host'
[20:51:20] <cdanis>	 most of them rx'd more than they tx'd, but a few of them had a big spike of tx
[20:51:21] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] librenms: ensure php7.2-ldap is installed [puppet] - 10https://gerrit.wikimedia.org/r/507487 (https://phabricator.wikimedia.org/T207706) (owner: 10Dzahn)
[20:52:11] <chaomodus>	 like mw1251
[20:52:26] <chaomodus>	 like a proportionally huge spike
[20:53:14] <chaomodus>	 like all of them show the same pattern except for two with the tx spikes
[20:53:16] <chaomodus>	 neat
[20:53:18] <chaomodus>	 that is weird
[20:53:37] <cdanis>	 three -- mw1251, mw1268, mw1320
[20:53:51] <chaomodus>	 ah missed that last one
[20:53:52] <cdanis>	 which started at approx the same time, 20:15
[20:57:31] <chaomodus>	 it's funny how much larger those spikes are than the event rx spikes
[20:57:41] <thcipriani>	 so looking at the logstash scap dashboard merge-cdb-updates finished up close to the time ofthe end of the spike on 1251: Updated 417 CDB files(s) in /srv/mediawiki/php-1.34.0-wmf.3/cache/l10n 2019-04-30T20:26:18￼
[20:59:05] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[21:00:27] <wikibugs>	 (03PS1) 10Clarakosi: Add support for OpenAPI 3.0 [software/service-checker] - 10https://gerrit.wikimedia.org/r/507489 (https://phabricator.wikimedia.org/T218218)
[21:03:38] <mutante>	 !log librenms - switched from PHP 7.0 to PHP 7.2 succesful now. reverted manual changes for debugging on netmon1002
[21:03:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:03:52] <mutante>	 !log netmon2001 -  apt-get remove --purge php7.0*
[21:03:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:04:19] <wikibugs>	 (03CR) 10Ppchelko: [C: 03+1] Add support for OpenAPI 3.0 [software/service-checker] - 10https://gerrit.wikimedia.org/r/507489 (https://phabricator.wikimedia.org/T218218) (owner: 10Clarakosi)
[21:05:44] <thcipriani>	 cdanis: chaomodus still digging? can I go ahead with 1.34.0-wmf.3 to group0?
[21:05:55] <cdanis>	 oh, no, sorry, I think it's fine
[21:05:58] <cdanis>	 proceed
[21:06:06] <chaomodus>	 Yah it seems okay from my perspective
[21:06:07] <mutante>	 !log netmon2001 -  apt-get install php-common php-pear (pending upgrades)
[21:06:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:06:10] <thcipriani>	 thanks :)
[21:06:11] <chaomodus>	 just was playing with graphs a bit :)
[21:06:12] <cdanis>	 there's some strange stuff here but nothing alarming
[21:06:19] <thcipriani>	 agreed
[21:06:27] <cdanis>	 and it's not like I actually know anything about the appservers or scap in the first place 🙃
[21:07:47] <chaomodus>	 I'm not seeing how these things could correlate
[21:07:51] <chaomodus>	 but yah same
[21:08:11] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Group0 to 1.34.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507405 (owner: 10Thcipriani)
[21:09:24] <wikibugs>	 (03Merged) 10jenkins-bot: Group0 to 1.34.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507405 (owner: 10Thcipriani)
[21:10:28] <mutante>	 !log netmon1002 - apt-get remove --purge php 7.0* ; apt-get install php-common php-pear (pending upgrades) | netmon2001: apt autoremove
[21:10:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:11:59] <wikibugs>	 (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/507487  working now!  also i removed the 7.0 packages and cleaned up" [puppet] - 10https://gerrit.wikimedia.org/r/507391 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi)
[21:13:33] <logmsgbot>	 !log thcipriani@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.34.0-wmf.3
[21:13:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:14:03] <wikibugs>	 10Operations, 10TechCom, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Later), and 5 others: Establish an SLA for session storage - https://phabricator.wikimedia.org/T211721 (10EvanProdromou) >>! In T211721#5148423, @Eevans wrote: >>>! In T211721#5145107, @EvanProd...
[21:15:07] <thcipriani>	 ottomata: FYI, 1.34.0-wmf.3 live on group0
[21:15:27] <ottomata>	 thank you!
[21:15:33] <ottomata>	 its working
[21:24:45] <wikibugs>	 10Operations, 10observability, 10Patch-For-Review: LibreNMS upgrade to 1.51 - https://phabricator.wikimedia.org/T207706 (10Dzahn) - [[ https://gerrit.wikimedia.org/r/507391 | first change ]] installed the 7.2 packages but did not change which Apache module was loaded, so still used 7.0 while both packages we...
[21:24:59] <wikibugs>	 (03CR) 10jenkins-bot: Group0 to 1.34.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507405 (owner: 10Thcipriani)
[21:25:22] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: tagged_interface sometimes exceeds IFNAMSIZ - https://phabricator.wikimedia.org/T209707 (10Andrew) *bump*  I still need something like https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/474272/ in order to get cloudvirt1024 online (and to pave the way towards...
[21:31:41] <wikibugs>	 10Operations, 10ops-codfw: wtp2019 shows error messages in the racadm getsel's output - https://phabricator.wikimedia.org/T221572 (10Dzahn) p:05Triage→03Normal
[21:32:40] <wikibugs>	 10Operations, 10Domains, 10Traffic: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080 (10Dzahn) p:05Triage→03Normal
[21:33:30] <wikibugs>	 10Operations, 10observability, 10Wikimedia-Incident: prometheus: upgrade to 2.9.2 - https://phabricator.wikimedia.org/T222113 (10Dzahn) p:05Triage→03Normal
[21:34:21] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2015 is OK: OK - running: The system is fully operational
[21:37:39] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure: Jessie rsyslog_8.1901.0-1~bpo8+wmf1_amd64.deb package fails to upgrade - https://phabricator.wikimedia.org/T222166 (10Dzahn) p:05Triage→03High
[21:38:03] <wikibugs>	 10Operations, 10observability, 10Wikimedia-Incident: figure out why Kafka dashboard hammers Prometheus, and fix it - https://phabricator.wikimedia.org/T222112 (10Dzahn) p:05Triage→03High
[21:38:33] <wikibugs>	 10Operations, 10observability, 10Wikimedia-Incident: prometheus: some sort of IRC alerts on restarts? - https://phabricator.wikimedia.org/T222108 (10Dzahn) p:05Triage→03Normal
[21:40:27] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Epic: [Epic] Scaling strategy for Wikidata Query Service - https://phabricator.wikimedia.org/T221938 (10Dzahn) p:05Triage→03Normal
[21:40:41] <wikibugs>	 10Operations, 10observability, 10Wikimedia-Incident: prometheus: usable dashboard for meta-metrics about Prometheus itself (query durations etc) - https://phabricator.wikimedia.org/T222102 (10Dzahn) p:05Triage→03Normal
[21:41:00] <wikibugs>	 10Operations, 10cloud-services-team: labstore100[45]/labpuppetmaster/labtestpuppetmaster2001: Broken package state; would upgrade to sysvinit - https://phabricator.wikimedia.org/T222148 (10Dzahn) p:05Triage→03High
[21:42:32] <wikibugs>	 10Operations, 10TechCom, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Later), and 5 others: Establish an SLA for session storage - https://phabricator.wikimedia.org/T211721 (10EvanProdromou) >>! In T211721#5009838, @aaron wrote:  > I think 10x of a normal SET in a...
[21:43:51] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[21:44:06] <sbassett>	 !log Deployed patch for T222036 (1.34.0-wmf.1 and 1.34.0-wmf.3)
[21:44:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:44:19] <sbassett>	 !log Deployed patch for T222038 (1.34.0-wmf.1 and 1.34.0-wmf.3)
[21:44:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:49:07] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[21:49:42] <wikibugs>	 (03PS2) 10Andrew Bogott: wmcs: Remove puppet code for the 'main' region [puppet] - 10https://gerrit.wikimedia.org/r/507340 (https://phabricator.wikimedia.org/T167293)
[21:49:44] <wikibugs>	 (03PS1) 10Andrew Bogott: OpenStack: Update firewall defines to remove references to things in ::main:: [puppet] - 10https://gerrit.wikimedia.org/r/507505
[21:49:46] <wikibugs>	 (03PS1) 10Andrew Bogott: labtest/codfw-dev: remove some dangling references to the main region [puppet] - 10https://gerrit.wikimedia.org/r/507506
[21:49:48] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs: update or remove some old references to the main region [puppet] - 10https://gerrit.wikimedia.org/r/507507
[21:49:50] <wikibugs>	 (03PS1) 10Andrew Bogott: prometheus: update references to the no-longer-existing 'main' deploy [puppet] - 10https://gerrit.wikimedia.org/r/507508
[21:49:52] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs: remove hiera references to the now-deleted main deploy [puppet] - 10https://gerrit.wikimedia.org/r/507509
[21:51:48] <wikibugs>	 (03PS1) 10CRusnov: Update requirements and artifacts for Netbox v2.5.11 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/507510
[21:52:58] <logmsgbot>	 !log mobrovac@deploy1001 Started deploy [restbase/deploy@b3b140f] (dev-cluster): Parsoid: use the new stashing tables for old revisions too
[21:53:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:56:20] <logmsgbot>	 !log mobrovac@deploy1001 Finished deploy [restbase/deploy@b3b140f] (dev-cluster): Parsoid: use the new stashing tables for old revisions too (duration: 03m 22s)
[21:56:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:57:31] <logmsgbot>	 !log mobrovac@deploy1001 Started deploy [restbase/deploy@b3b140f]: Parsoid: Use the new stash tables for old revisions - T215956
[21:57:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:57:36] <stashbot>	 T215956: Consider stashing data-parsoid for VE  - https://phabricator.wikimedia.org/T215956
[22:04:20] <wikibugs>	 (03CR) 10Mobrovac: [C: 04-1] "LGTM, one comment in-lined." (031 comment) [software/service-checker] - 10https://gerrit.wikimedia.org/r/507489 (https://phabricator.wikimedia.org/T218218) (owner: 10Clarakosi)
[22:21:28] <logmsgbot>	 !log mobrovac@deploy1001 Finished deploy [restbase/deploy@b3b140f]: Parsoid: Use the new stash tables for old revisions - T215956 (duration: 23m 56s)
[22:21:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:21:32] <stashbot>	 T215956: Consider stashing data-parsoid for VE  - https://phabricator.wikimedia.org/T215956
[22:47:48] <wikibugs>	 10Operations, 10DC-Ops, 10SRE-Access-Requests, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10Dzahn) We had a hangout and chat and talked about the remaining things and agreed they are currently not needed. If Willy is blocked by anything we will revisit.
[22:53:17] <wikibugs>	 10Operations, 10DC-Ops, 10SRE-Access-Requests, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10Dzahn) 05Open→03Resolved
[22:56:12] <wikibugs>	 (03CR) 10CRusnov: "I have tested this on af-netbox." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/507510 (owner: 10CRusnov)
[23:00:04] <jouncebot>	 MaxSem, RoanKattouw, and Niharika: #bothumor I � Unicode. All rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190430T2300).
[23:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[23:00:59] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[23:07:28] <logmsgbot>	 !log ayounsi@deploy1001 Started deploy [librenms/librenms@2094575]: Upgrade LibreNMS to 1.51 - T207481
[23:07:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:07:33] <logmsgbot>	 !log ayounsi@deploy1001 Finished deploy [librenms/librenms@2094575]: Upgrade LibreNMS to 1.51 - T207481 (duration: 00m 05s)
[23:07:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:13:10] <wikibugs>	 (03PS2) 10Paladox: Merge tag 'v2.15.13' into wmf/stable-2.15 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/505801
[23:13:22] <wikibugs>	 (03PS1) 10Paladox: Update plugins to 2.15.13 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/507521
[23:14:46] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10mobrovac) >>! In T220402#5146064, @Pablo-WMDE wrote: > @mobrovac During T221755 & T221754 we tended to [[ https://ssr-termbox.w...
[23:16:11] <wikibugs>	 (03PS1) 10Dzahn: admins: add Joel Aufrecht to ldap_only_admins [puppet] - 10https://gerrit.wikimedia.org/r/507522 (https://phabricator.wikimedia.org/T222214)
[23:16:21] <wikibugs>	 (03PS4) 10ArielGlenn: split up page content jobs with max bytes per page range [dumps] - 10https://gerrit.wikimedia.org/r/507268 (https://phabricator.wikimedia.org/T221504)
[23:16:24] <wikibugs>	 (03PS2) 10Paladox: Update plugins to 2.15.13 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/507521
[23:17:23] <icinga-wm>	 PROBLEM - LibreNMS HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 287 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/LibreNMS
[23:18:08] <logmsgbot>	 !log ayounsi@deploy1001 Started deploy [librenms/librenms@0fd8da6]: Rollback LibreNMS
[23:18:10] <wikibugs>	 (03PS2) 10Dzahn: admins: remove ability to run commands as user 'apache' [puppet] - 10https://gerrit.wikimedia.org/r/506750 (https://phabricator.wikimedia.org/T78076)
[23:18:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:18:13] <logmsgbot>	 !log ayounsi@deploy1001 Finished deploy [librenms/librenms@0fd8da6]: Rollback LibreNMS (duration: 00m 05s)
[23:18:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:18:41] <icinga-wm>	 RECOVERY - LibreNMS HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 200 OK - 8737 bytes in 0.032 second response time https://wikitech.wikimedia.org/wiki/LibreNMS
[23:19:57] <wikibugs>	 (03PS1) 10Paladox: Remove quota plugin [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/507523
[23:22:53] <wikibugs>	 10Operations, 10observability, 10Patch-For-Review: LibreNMS upgrade to 1.51 - https://phabricator.wikimedia.org/T207706 (10ayounsi) I tried to deploy it once again.  1/ They replaced log_file with log_dir, this will need a puppet change I temporarily worked around it but:  2/ App is not loading and this is s...
[23:30:11] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[23:30:11] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[23:30:25] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[23:31:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[23:31:19] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[23:31:19] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[23:34:19] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] split up page content jobs with max bytes per page range [dumps] - 10https://gerrit.wikimedia.org/r/507268 (https://phabricator.wikimedia.org/T221504) (owner: 10ArielGlenn)
[23:35:03] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2015 is OK: OK - running: The system is fully operational
[23:35:40] <logmsgbot>	 !log ariel@deploy1001 Started deploy [dumps/dumps@d715ea0]: determine page ranges of content output files by cumul revision length as well as rev count
[23:35:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:35:43] <logmsgbot>	 !log ariel@deploy1001 Finished deploy [dumps/dumps@d715ea0]: determine page ranges of content output files by cumul revision length as well as rev count (duration: 00m 03s)
[23:35:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:36:51] <apergos>	 deploy and sleep, that's me
[23:36:52] <paladox>	 it's exciting to see gerrit's metrics showing improvements since yesturday! (threads are lower than they have been for the whole month!)
[23:37:07] <apergos>	 (but the job won't start until tomorrow 11 am my time so it's all good)
[23:37:38] <wikibugs>	 (03PS1) 10Papaul: DNS: Remoce mgmt and production DNS for db2014,db2020,db2021,db2022,db2024,db2031 [dns] - 10https://gerrit.wikimedia.org/r/507525
[23:43:30] <wikibugs>	 (03PS1) 10Ayounsi: LibreNMS, add log dir [puppet] - 10https://gerrit.wikimedia.org/r/507526 (https://phabricator.wikimedia.org/T207706)
[23:46:30] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] LibreNMS, add log dir [puppet] - 10https://gerrit.wikimedia.org/r/507526 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi)
[23:47:46] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/16251/netmon1002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/507526 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi)
[23:48:59] <mutante>	 paladox: very nice :)
[23:49:06] <mutante>	 bbiaw
[23:49:06] <paladox>	 yup :)
[23:49:49] <logmsgbot>	 !log ayounsi@deploy1001 Started deploy [librenms/librenms@2094575]: Upgrade LibreNMS to 1.51 - T207481
[23:49:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:49:54] <logmsgbot>	 !log ayounsi@deploy1001 Finished deploy [librenms/librenms@2094575]: Upgrade LibreNMS to 1.51 - T207481 (duration: 00m 04s)
[23:49:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:56:39] <logmsgbot>	 !log ayounsi@deploy1001 Started deploy [librenms/librenms@0fd8da6]: Rollback LibreNMS
[23:56:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:56:44] <logmsgbot>	 !log ayounsi@deploy1001 Finished deploy [librenms/librenms@0fd8da6]: Rollback LibreNMS (duration: 00m 05s)
[23:56:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:57:57] <Praxidicae>	 i keep getting this error every time i try to open wikipedia from google
[23:58:04] <Praxidicae>	 Request from 2601:14a:c201:2d54:a42a:38c9:9ec7:44e1 via cp1085 cp1085, Varnish XID 866091559 Error: 503, Backend fetch failed at Tue, 30 Apr 2019 23:57:35 GMT
[23:58:47] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[23:58:49] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[23:58:49] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[23:58:55] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[23:58:59] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[23:59:05] <paladox>	 i guess that must be it ^^
[23:59:15] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[23:59:41] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[23:59:43] <icinga-wm>	 PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[23:59:47] <Reedy>	 mutante: XioNoX ^