[00:01:36] (03PS1) 10Bstorm: cloudstore: add to the script for syncserver [puppet] - 10https://gerrit.wikimedia.org/r/507227 (https://phabricator.wikimedia.org/T209527) [00:02:43] (03CR) 10Bstorm: [C: 03+2] cloudstore: add to the script for syncserver [puppet] - 10https://gerrit.wikimedia.org/r/507227 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [00:06:38] (03PS3) 10CDanis: icinga: pause nsca on reloads [puppet] - 10https://gerrit.wikimedia.org/r/504898 (https://phabricator.wikimedia.org/T196336) [00:07:19] (03CR) 10CDanis: [C: 03+2] icinga: pause nsca on reloads [puppet] - 10https://gerrit.wikimedia.org/r/504898 (https://phabricator.wikimedia.org/T196336) (owner: 10CDanis) [00:12:51] (03PS1) 10Bstorm: cloudstore: add to role for the syncing [puppet] - 10https://gerrit.wikimedia.org/r/507229 (https://phabricator.wikimedia.org/T209527) [00:13:44] (03PS1) 10Dzahn: admins: add Harumi Monroy to ldap_only_admins [puppet] - 10https://gerrit.wikimedia.org/r/507230 (https://phabricator.wikimedia.org/T222110) [00:14:03] (03PS1) 10Ayounsi: Add fake ssh keys for netbox network user [labs/private] - 10https://gerrit.wikimedia.org/r/507231 [00:14:08] (03CR) 10Bstorm: [C: 03+2] cloudstore: add to role for the syncing [puppet] - 10https://gerrit.wikimedia.org/r/507229 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [00:14:13] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 101 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, number_of_in_flight_fetch: 0, number_of_data_nodes: 4, timed_out: False, unassigned_shards: 101, number_of_pending_tasks: 0, relocating_shards: 0, initializing_shards: 0, number_of_nodes: 4, status: red, delayed_unassigned_shards: 0, act [00:14:13] t_as_number: 82.91032148900169, active_primary_shards: 182, active_shards: 490, task_max_waiting_in_queue_millis: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [00:14:17] (03CR) 10BryanDavis: wikitech: Disable Gerrit accounts when blocked on wikitech (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506587 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis) [00:14:29] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 101 threshold =0.15 breach: delayed_unassigned_shards: 0, task_max_waiting_in_queue_millis: 0, active_primary_shards: 182, status: red, timed_out: False, number_of_pending_tasks: 0, cluster_name: cloudelastic-chi-eqiad, number_of_data_nodes: 4, number_of_nodes: 4, number_of_in_flight_fetch: 0, initiali [00:14:29] ctive_shards: 490, relocating_shards: 0, unassigned_shards: 101, active_shards_percent_as_number: 82.91032148900169 https://wikitech.wikimedia.org/wiki/Search%23Administration [00:16:23] (03CR) 10Dzahn: [C: 03+2] admins: add Harumi Monroy to ldap_only_admins [puppet] - 10https://gerrit.wikimedia.org/r/507230 (https://phabricator.wikimedia.org/T222110) (owner: 10Dzahn) [00:16:31] (03PS2) 10Dzahn: admins: add Harumi Monroy to ldap_only_admins [puppet] - 10https://gerrit.wikimedia.org/r/507230 (https://phabricator.wikimedia.org/T222110) [00:17:04] (03PS1) 10Bstorm: cloudstore: fix one more mistake in syncserver [puppet] - 10https://gerrit.wikimedia.org/r/507232 (https://phabricator.wikimedia.org/T209527) [00:17:07] cloudelastic alerts are expected, it's in the middle of creating all the indices for all the wikis [00:17:59] (03PS2) 10Ayounsi: Netbox: add j2nb support [puppet] - 10https://gerrit.wikimedia.org/r/507224 [00:18:48] (03PS2) 10Bstorm: cloudstore: fix one more mistake in syncserver [puppet] - 10https://gerrit.wikimedia.org/r/507232 (https://phabricator.wikimedia.org/T209527) [00:19:24] (03CR) 10Bstorm: [C: 03+2] cloudstore: fix one more mistake in syncserver [puppet] - 10https://gerrit.wikimedia.org/r/507232 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [00:21:49] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1003 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: timed_out: False, number_of_nodes: 4, active_shards_percent_as_number: 85.3568800588668, relocating_shards: 0, status: red, active_primary_shards: 422, cluster_name: cloudelastic-chi-eqiad, active_shards: 1160, number_of_data_nodes: 4, task_max_waiting_in_queue_millis: 0, number_of_pending_tasks: 0 [00:21:49] ds: 199, delayed_unassigned_shards: 0, number_of_in_flight_fetch: 0, initializing_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [00:22:37] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1001 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: active_shards: 1247, unassigned_shards: 199, initializing_shards: 0, number_of_in_flight_fetch: 0, relocating_shards: 0, number_of_nodes: 4, timed_out: False, status: red, active_primary_shards: 451, active_shards_percent_as_number: 86.23789764868603, number_of_data_nodes: 4, number_of_pending_task [00:22:37] e: cloudelastic-chi-eqiad, delayed_unassigned_shards: 0, task_max_waiting_in_queue_millis: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [00:50:16] 10Operations, 10cloud-services-team, 10serviceops: Change /r/p/ to /r/ on all hosts (where https://gerrit.wikimedia.org/r/p/ exists) - https://phabricator.wikimedia.org/T222093 (10Dzahn) [00:57:24] 10Operations, 10Phabricator, 10Project-Admins, 10SRE-Access-Requests, and 2 others: Document how to convert projects into subprojects/milestones etc - https://phabricator.wikimedia.org/T221112 (10Dzahn) [00:58:49] 10Operations, 10Phabricator, 10Project-Admins, 10SRE-Access-Requests, and 2 others: Document how to convert projects into subprojects/milestones etc - https://phabricator.wikimedia.org/T221112 (10Dzahn) >>! In T221112#5121789, @mmodell wrote: @Dzahn can you help me figure out how to allow @aklapper to run... [01:02:57] 10Operations, 10Phabricator, 10Project-Admins, 10SRE-Access-Requests, and 2 others: Document how to convert projects into subprojects/milestones etc (sudo privileges for phab admins to run move_project script) - https://phabricator.wikimedia.org/T221112 (10Dzahn) [01:12:59] (03CR) 10Dzahn: [C: 03+1] Phab: Allow greg and aklapper to convert projects to subprojects/milestones [puppet] - 10https://gerrit.wikimedia.org/r/505667 (https://phabricator.wikimedia.org/T221112) (owner: 10Aklapper) [01:22:33] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventBus, and 5 others: Create schema[12]00[12] (schema.svc.{eqiad,codfw}.wmnet) - https://phabricator.wikimedia.org/T219556 (10Dzahn) When adding a new type of server name please add them in these 2 places: - wikitech https://wikitech.wikimedia.org/wiki/In... [01:25:11] (03PS5) 10Dzahn: zuul: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507070 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [01:27:15] (03CR) 10Dzahn: [C: 03+2] zuul: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507070 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [01:30:00] !log contint2001..then contint1001 - deleting /etc/zuul/wikimedia and letting puppet re-clone it (gerrit:507070) (T218844) [01:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:30:04] T218844: Update Gerrit /r/p/ links to /r/ - https://phabricator.wikimedia.org/T218844 [01:35:07] 10Operations, 10cloud-services-team, 10serviceops: Change /r/p/ to /r/ on all hosts (where https://gerrit.wikimedia.org/r/p/ exists) - https://phabricator.wikimedia.org/T222093 (10Dzahn) [01:35:32] 10Operations, 10cloud-services-team, 10serviceops: Change /r/p/ to /r/ on all hosts (where https://gerrit.wikimedia.org/r/p/ exists) - https://phabricator.wikimedia.org/T222093 (10Dzahn) [01:44:15] RECOVERY - Host db1093 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [01:46:56] PROBLEM - mysqld processes on db1093 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [01:47:39] PROBLEM - MariaDB read only s6 on db1093 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [01:48:00] PROBLEM - MariaDB Slave IO: s6 on db1093 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [01:48:04] uhmm.. i got paged because of that [01:48:06] PROBLEM - MariaDB Slave SQL: s6 on db1093 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [01:48:15] but also i think it's normal that they dont start up after reboot [01:53:35] I do believe DBs not starting up on boot is normal and expected [01:53:43] (03PS1) 10Dzahn: depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507237 [01:53:53] yes, it is. but i think we are still supposed to depool it [01:53:59] ? [01:54:16] I would assume so [01:54:53] (03CR) 10Gergő Tisza: wikitech: Disable Gerrit accounts when blocked on wikitech (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506587 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis) [01:54:56] (03CR) 10Cwhite: [C: 03+1] depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507237 (owner: 10Dzahn) [01:56:44] PROBLEM - MariaDB Slave Lag: s6 on db1093 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:01:36] (03CR) 10Gergő Tisza: [C: 03+2] depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507237 (owner: 10Dzahn) [02:02:03] ^ doing an emergency mw-config deploy [02:02:06] thanks tgr! [02:02:20] making a DBA ticket for it [02:02:44] (03Merged) 10jenkins-bot: depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507237 (owner: 10Dzahn) [02:04:34] mutante: I suppose this is not testable on mwdebug? [02:06:33] tgr: no, i don't think so. but we have "Once the change is deployed, we should be able to see our change on: https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php " [02:06:59] well..that is after the fact [02:07:40] this is the official example from the docs how a diff should look: [02:07:41] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/447984/1/wmf-config/db-eqiad.php [02:09:34] !log tgr@deploy1001 Synchronized wmf-config/db-eqiad.php: SWAT: [[gerrit:507237|depool db1093]] (duration: 00m 54s) [02:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:10:13] (03CR) 10jenkins-bot: depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507237 (owner: 10Dzahn) [02:10:31] PROBLEM - HP RAID on db1093 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [02:10:33] ACKNOWLEDGEMENT - HP RAID on db1093 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T222128 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [02:10:38] 10Operations, 10ops-eqiad: Degraded RAID on db1093 - https://phabricator.wikimedia.org/T222128 (10ops-monitoring-bot) [02:10:44] and that RAID issue is why it went down.. i think [02:10:55] there were lots of alerts in SOFT state earlier [02:11:13] now it finally went from SOFT to HARD [02:12:23] i DO see the changes now on https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php confirmed [02:13:49] connection errors are gone [02:13:55] tgr: :) thanks [02:13:56] 10Operations, 10ops-eqiad: Degraded RAID on db1093 - https://phabricator.wikimedia.org/T222128 (10Dzahn) This host started paging us for being rebooted a little before this ticket has been created. It is already depooled. -> T222127 [02:15:20] ACKNOWLEDGEMENT - MariaDB Slave IO: s6 on db1093 is CRITICAL: CRITICAL slave_io_state could not connect daniel_zahn depooled - https://phabricator.wikimedia.org/T222127 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:15:21] ACKNOWLEDGEMENT - MariaDB Slave Lag: s6 on db1093 is CRITICAL: CRITICAL slave_sql_lag could not connect daniel_zahn depooled - https://phabricator.wikimedia.org/T222127 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:15:22] ACKNOWLEDGEMENT - MariaDB Slave SQL: s6 on db1093 is CRITICAL: CRITICAL slave_sql_state could not connect daniel_zahn depooled - https://phabricator.wikimedia.org/T222127 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:15:22] ACKNOWLEDGEMENT - MariaDB read only s6 on db1093 is CRITICAL: Could not connect to localhost:3306 daniel_zahn depooled - https://phabricator.wikimedia.org/T222127 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [02:15:23] ACKNOWLEDGEMENT - mysqld processes on db1093 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld daniel_zahn depooled - https://phabricator.wikimedia.org/T222127 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [02:16:41] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1093 - https://phabricator.wikimedia.org/T222128 (10Dzahn) [02:17:56] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T222127 , https://phabricator.wikimedia.org/T222128" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507237 (owner: 10Dzahn) [02:19:43] 10Operations, 10ops-eqiad: kafka1023 correctable memory errors - https://phabricator.wikimedia.org/T194249 (10Dzahn) showing up in icinga as a new issue: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=kafka1023&service=Memory+correctable+errors+-EDAC- [02:20:04] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on kafka1023 is CRITICAL: 4.001 ge 4 daniel_zahn https://phabricator.wikimedia.org/T194249 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=kafka1023&var-datasource=eqiad+prometheus/ops [02:21:29] back at an actual computer now, anything I can help with shdubsh mutante tgr ? [02:21:59] cdanis: thank you, i think it's done. followed the DBA docs to depool it and make a ticket [02:22:17] 👍 [02:22:21] RECOVERY - Check systemd state on analytics1050 is OK: OK - running: The system is fully operational [02:23:29] !log analytics1050 - systemctl start mclog ... it was failed like recently on analytics1052 (T212219 ?) [02:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:23:34] T212219: wmf-auto-restart fails on certain legacy services - https://phabricator.wikimedia.org/T212219 [02:24:02] and with that Icinga looks clean again and i'm out [02:25:03] cya tomorrow [02:29:07] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [02:29:15] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [02:30:23] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 35, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:31:53] RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 1166 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [02:33:01] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:34:29] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.21 ms [02:34:37] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [02:39:33] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 35, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:41:15] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: CRITICAL - Destination Unreachable (2607:f6f0:205::153) [02:42:11] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:42:35] (03CR) 10Zoranzoki21: [C: 03+1] Set wgArticleCountMethod='any' for bgwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506943 (https://phabricator.wikimedia.org/T222044) (owner: 10Ammarpad) [02:43:33] (03CR) 10Zoranzoki21: [C: 03+1] "Should be ok..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506892 (https://phabricator.wikimedia.org/T222024) (owner: 10DannyS712) [02:46:39] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.36 ms [02:50:40] (03CR) 10Zoranzoki21: [C: 03+1] Add 'Author' namespace in Sanskrit Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486221 (https://phabricator.wikimedia.org/T214553) (owner: 10Ammarpad) [02:51:27] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 16912696 and 0 seconds [02:52:47] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 222896 and 71 seconds [03:10:39] PROBLEM - puppet last run on conf1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:26:44] 10Operations, 10TechCom, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Later), and 6 others: Establish an SLA for session storage - https://phabricator.wikimedia.org/T211721 (10aaron) >>! In T211721#5009838, @aaron wrote: > The SET metric for redis is very slow, so... [03:42:29] RECOVERY - puppet last run on conf1005 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [03:50:19] PROBLEM - Check systemd state on ms-be2013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:55:29] PROBLEM - puppet last run on an-coord1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:55:31] RECOVERY - Check systemd state on ms-be2013 is OK: OK - running: The system is fully operational [03:55:55] PROBLEM - puppet last run on analytics1070 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:03:13] PROBLEM - puppet last run on mw1309 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:22:29] RECOVERY - puppet last run on analytics1070 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:26:25] !log LDAP - remove user pirroh from group nda (T222085 and cross-validate-accounts demands consistency) [04:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:26:29] T222085: Revoke @pirroh's shell access - https://phabricator.wikimedia.org/T222085 [04:27:23] RECOVERY - puppet last run on an-coord1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [04:28:32] (03CR) 10Giuseppe Lavagetto: [C: 03+1] registryha: feat: introducing LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/506367 (https://phabricator.wikimedia.org/T221101) (owner: 10Fsero) [04:30:33] PROBLEM - puppet last run on rdb1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:31:59] 10Operations, 10Wikimedia-Site-requests, 10acl*stewards: Create accounts for new stewards in closed wikis - https://phabricator.wikimedia.org/T222117 (10kolbert) I agree that there should be some established procedure for cases where it becomes apparent action needs to be taken on a closed wiki. @Base are... [04:34:26] (03PS3) 10Giuseppe Lavagetto: confctl: Add filter_objects and update_objects [software/spicerack] - 10https://gerrit.wikimedia.org/r/503946 [04:34:28] (03PS2) 10Giuseppe Lavagetto: confctl: add change_and_revert contextmanager [software/spicerack] - 10https://gerrit.wikimedia.org/r/504578 [04:34:30] (03PS3) 10Giuseppe Lavagetto: Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 [04:34:39] (03CR) 10jerkins-bot: [V: 04-1] Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto) [04:35:03] RECOVERY - puppet last run on mw1309 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [04:37:41] (03PS4) 10Giuseppe Lavagetto: confctl: Add filter_objects and update_objects [software/spicerack] - 10https://gerrit.wikimedia.org/r/503946 [04:37:43] (03PS3) 10Giuseppe Lavagetto: confctl: add change_and_revert contextmanager [software/spicerack] - 10https://gerrit.wikimedia.org/r/504578 [04:37:45] (03PS4) 10Giuseppe Lavagetto: Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 [04:41:15] (03CR) 10jerkins-bot: [V: 04-1] confctl: Add filter_objects and update_objects [software/spicerack] - 10https://gerrit.wikimedia.org/r/503946 (owner: 10Giuseppe Lavagetto) [04:41:33] (03CR) 10jerkins-bot: [V: 04-1] Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto) [04:41:37] (03CR) 10jerkins-bot: [V: 04-1] confctl: add change_and_revert contextmanager [software/spicerack] - 10https://gerrit.wikimedia.org/r/504578 (owner: 10Giuseppe Lavagetto) [04:44:17] (03CR) 10Giuseppe Lavagetto: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/503946 (owner: 10Giuseppe Lavagetto) [04:48:57] (03CR) 10jerkins-bot: [V: 04-1] confctl: Add filter_objects and update_objects [software/spicerack] - 10https://gerrit.wikimedia.org/r/503946 (owner: 10Giuseppe Lavagetto) [04:49:13] <_joe_> uhm I really don't get this [04:51:03] <_joe_> and this was the rebase, indeed [04:53:52] The tests won't run [04:54:08] Volans said something about fixing it yesterday [04:54:46] Had the same issue last week [04:57:01] RECOVERY - puppet last run on rdb1010 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [04:58:14] <_joe_> how did this happen, I could dig deeper [05:02:11] <_joe_> oh I see now running tests locally [05:04:52] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1093 - https://phabricator.wikimedia.org/T222128 (10Marostegui) Per that output, looks like the BBU is gone, let's follow the investigation at {T222127} [05:04:53] (03PS1) 10Marostegui: db1093: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/507241 (https://phabricator.wikimedia.org/T222127) [05:05:07] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1093 - https://phabricator.wikimedia.org/T222128 (10Marostegui) [05:08:27] (03CR) 10Marostegui: [C: 03+2] db1093: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/507241 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui) [05:18:12] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1093 (s6 candidate master) went down - broken BBU - https://phabricator.wikimedia.org/T222127 (10Marostegui) [05:22:00] (03PS1) 10Marostegui: db-eqiad.php: Clarify db1093 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507242 (https://phabricator.wikimedia.org/T222127) [05:24:04] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Clarify db1093 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507242 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui) [05:24:37] (03PS1) 10Marostegui: db2045: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/507243 (https://phabricator.wikimedia.org/T219493) [05:25:07] (03Merged) 10jenkins-bot: db-eqiad.php: Clarify db1093 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507242 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui) [05:26:27] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Clarify db1093's status (duration: 00m 55s) [05:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:26] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Clarify db1093's status (duration: 00m 51s) [05:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:41] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1093 (s6 candidate master) went down - broken BBU - https://phabricator.wikimedia.org/T222127 (10Marostegui) I have started MySQL which started correctly. As it started fine, I have started replication too, once it has caught up, I am going to do a da... [05:35:01] (03CR) 10jenkins-bot: db-eqiad.php: Clarify db1093 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507242 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui) [05:38:21] (03PS3) 10Elukey: admin: allow analytics-admins to control jupyter user units [puppet] - 10https://gerrit.wikimedia.org/r/504067 [05:40:08] (03CR) 10Elukey: [C: 03+2] admin: allow analytics-admins to control jupyter user units [puppet] - 10https://gerrit.wikimedia.org/r/504067 (owner: 10Elukey) [05:40:27] (03CR) 10Marostegui: [C: 03+2] db2045: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/507243 (https://phabricator.wikimedia.org/T219493) (owner: 10Marostegui) [05:40:29] (03CR) 10Elukey: [C: 03+2] "This has been approved by the SRE team meeting." [puppet] - 10https://gerrit.wikimedia.org/r/504067 (owner: 10Elukey) [05:40:34] (03PS2) 10Marostegui: db2045: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/507243 (https://phabricator.wikimedia.org/T219493) [05:41:38] elukey: good to merge your change? [05:42:30] yep! [05:42:33] thanks :) [05:42:45] merging! [06:29:53] PROBLEM - puppet last run on an-worker1084 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/run-puppet-agent] [06:30:59] PROBLEM - puppet last run on cloudservices1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:45:56] (03PS3) 10Matthias Mullie: Allow cross-site requests from mobile domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506151 (https://phabricator.wikimedia.org/T221734) [06:56:11] RECOVERY - puppet last run on an-worker1084 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:16] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1093 (s6 candidate master) went down - broken BBU - https://phabricator.wikimedia.org/T222127 (10Marostegui) For what is worth, the LB looks like it worked fine. The time line is: 23:24: db1093 goes down 23:24-23:30: Spike of errors and then some res... [06:57:19] RECOVERY - puppet last run on cloudservices1004 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [07:24:08] !log Remove labservices1001 and labservices1002 from tendril T221857 [07:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:12] T221857: Decommission labservices1001, 1002 - https://phabricator.wikimedia.org/T221857 [07:29:06] !log installing systemd updates for jessie [07:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:29] 10Operations, 10SRE-Access-Requests: Allow analytics-admins to control jupyter user units - https://phabricator.wikimedia.org/T222087 (10elukey) 05Open→03Resolved p:05Triage→03Normal [07:39:45] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Revoke @pirroh's shell access - https://phabricator.wikimedia.org/T222085 (10MoritzMuehlenhoff) @RStallman-legalteam : The access for Michele Cataste has been removed, can you please also update the NDA tracking meta data? [07:41:08] (03PS1) 10Elukey: Add the dfs.namenode.handler.count HDFS option [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507250 (https://phabricator.wikimedia.org/T220702) [07:45:16] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/16179/an-master1001.eqiad.wmnet/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507250 (https://phabricator.wikimedia.org/T220702) (owner: 10Elukey) [07:46:06] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Revoke @pirroh's shell access - https://phabricator.wikimedia.org/T222085 (10MoritzMuehlenhoff) @Dzahn Best to use https://wikitech.wikimedia.org/wiki/Ops_Offboarding#Remove_user_from_privileged_groups ; it will prepare an LDIF to drop al... [07:50:06] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Performance-Team (Radar): Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10Gilles) It seems to be breaking navtiming (coal is fine, though): ` Apr 30 07:43:37 webperf1001 python[5681]: 2019-04-30 07:43:37,515 [... [07:53:34] !log gilles@deploy1001 Started deploy [performance/navtiming@e900152]: T221848 add more logging around startup [07:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:38] T221848: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 [07:53:39] !log gilles@deploy1001 Finished deploy [performance/navtiming@e900152]: T221848 add more logging around startup (duration: 00m 05s) [07:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:05] !log gilles@deploy1001 Started deploy [performance/navtiming@8f135ac]: T221848 Defalt to partition 0 when no partition is found [08:11:05] !log gilles@deploy1001 deploy aborted: T221848 Defalt to partition 0 when no partition is found (duration: 00m 00s) [08:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:09] T221848: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 [08:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:16] !log gilles@deploy1001 Started deploy [performance/navtiming@8f135ac]: T221848 Default to partition 0 when no partition is found [08:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:21] !log gilles@deploy1001 Finished deploy [performance/navtiming@8f135ac]: T221848 Default to partition 0 when no partition is found (duration: 00m 05s) [08:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:26] (03CR) 10Elukey: [C: 03+2] Add the dfs.namenode.handler.count HDFS option [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507250 (https://phabricator.wikimedia.org/T220702) (owner: 10Elukey) [08:17:02] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 2 others: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10Gilles) Fixed navtiming for now, I'll investigate further to make sure that this is a proper fix and not a hack. Right now I'm not sure that the metadat... [08:18:32] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10Pablo-WMDE) @mobrovac During T221755 & T221754 we tended to [[ https://ssr-termbox.wmflabs.org/?spec | `/?spec` ]] and [[ https... [08:19:04] (03PS1) 10Elukey: Make dfs_namenode_handler_count optional [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507257 (https://phabricator.wikimedia.org/T220702) [08:20:30] (03PS2) 10Elukey: Make dfs_namenode_handler_count optional [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507257 (https://phabricator.wikimedia.org/T220702) [08:21:47] (03PS3) 10Elukey: Make dfs_namenode_handler_count optional [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507257 (https://phabricator.wikimedia.org/T220702) [08:22:36] !log bounce prometheus on bast4002 after backfill has finished - T187987 [08:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:40] T187987: 100% of Prometheus traffic served by Prometheus v2 - https://phabricator.wikimedia.org/T187987 [08:30:10] (03PS4) 10Elukey: Make dfs_namenode_handler_count optional [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507257 (https://phabricator.wikimedia.org/T220702) [08:31:37] (03CR) 10Elukey: [C: 03+2] Make dfs_namenode_handler_count optional [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507257 (https://phabricator.wikimedia.org/T220702) (owner: 10Elukey) [08:32:21] PROBLEM - puppet last run on cobalt is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [08:34:45] (03PS1) 10Elukey: hadoop: raise dfs.namenode.handler.count from 10 to 80 [puppet] - 10https://gerrit.wikimedia.org/r/507259 (https://phabricator.wikimedia.org/T220702) [08:37:50] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/16183/" [puppet] - 10https://gerrit.wikimedia.org/r/507259 (https://phabricator.wikimedia.org/T220702) (owner: 10Elukey) [08:45:52] (03CR) 10Joal: "One comment on the value, except from that looks good" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507259 (https://phabricator.wikimedia.org/T220702) (owner: 10Elukey) [08:50:27] (03CR) 10Elukey: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507259 (https://phabricator.wikimedia.org/T220702) (owner: 10Elukey) [08:51:15] (03PS2) 10Elukey: hadoop: raise dfs.namenode.handler.count from 10 to 80 [puppet] - 10https://gerrit.wikimedia.org/r/507259 (https://phabricator.wikimedia.org/T220702) [08:52:10] (03PS1) 10Ema: Revert "package_builder: move lintian out of require_package" [puppet] - 10https://gerrit.wikimedia.org/r/507262 (https://phabricator.wikimedia.org/T221784) [08:52:45] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: 10M max-samples for all instances [puppet] - 10https://gerrit.wikimedia.org/r/507210 (https://phabricator.wikimedia.org/T222105) (owner: 10CDanis) [08:55:20] (03CR) 10Filippo Giunchedi: [C: 03+1] "On deployment: puppet will restart prometheus instances due to this, thus we'll need to do a controlled rollout" [puppet] - 10https://gerrit.wikimedia.org/r/507210 (https://phabricator.wikimedia.org/T222105) (owner: 10CDanis) [08:55:39] (03PS8) 10Fsero: registryha: feat: introducing LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/506367 (https://phabricator.wikimedia.org/T221101) [08:57:07] (03CR) 10Ema: [C: 03+2] Revert "package_builder: move lintian out of require_package" [puppet] - 10https://gerrit.wikimedia.org/r/507262 (https://phabricator.wikimedia.org/T221784) (owner: 10Ema) [08:57:18] (03CR) 10Elukey: [C: 03+2] hadoop: raise dfs.namenode.handler.count from 10 to 80 [puppet] - 10https://gerrit.wikimedia.org/r/507259 (https://phabricator.wikimedia.org/T220702) (owner: 10Elukey) [08:57:26] (03PS3) 10Elukey: hadoop: raise dfs.namenode.handler.count from 10 to 80 [puppet] - 10https://gerrit.wikimedia.org/r/507259 (https://phabricator.wikimedia.org/T220702) [08:58:49] RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:02:26] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1093 (s6 candidate master) went down - broken BBU - https://phabricator.wikimedia.org/T222127 (10Marostegui) The following tables have been checked against multiple hosts and reported no differences: ` archive logging page revision text user change_ta... [09:02:50] !log roll restart hdfs namenodes on an-master100[1,2] to pick up new settings - T220702 [09:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:54] T220702: Check if HDFS offers a way to prevent/limit/throttle users to overwhelm the HDFS Namenode - https://phabricator.wikimedia.org/T220702 [09:04:20] volans, moritzm: I've introduced a dependency cycle to test T221784, puppet failed on boron at 08:58. Still no trace of alerts in icinga [09:04:21] T221784: Puppet failing without Icinga alert in case of dependency cycle - https://phabricator.wikimedia.org/T221784 [09:05:05] (03CR) 10Effie Mouzeli: [V: 03+1] "LGTM https://puppet-compiler.wmflabs.org/compiler1002/16185/webperf1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/506626 (https://phabricator.wikimedia.org/T221401) (owner: 10Gilles) [09:05:15] ema: in the middle of something else, I can have a look in a few [09:05:28] (03CR) 10Giuseppe Lavagetto: [C: 03+1] setup.py: force urllib3 version [software/spicerack] - 10https://gerrit.wikimedia.org/r/506966 (owner: 10Volans) [09:05:28] volans: sure, no rush! [09:09:57] ema: I forced the puppet check and in fact it believes puppet is running fine: "OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures" [09:10:04] but a manual run on boron in fact fails [09:10:33] Puppet even fails to fail properly :-) [09:11:04] (03PS1) 10Marostegui: db-eqiad.php: Give some traffic to db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507263 (https://phabricator.wikimedia.org/T222127) [09:11:45] moritzm: https://twitter.com/HackerNewsOnion/status/1118542182842085376 [09:11:51] (03CR) 10Gehel: [C: 04-1] cookbook API: add class API (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [09:12:16] (03CR) 10Jcrespo: "Structurally this makes way more sense, I am ok with the philosophy, but need to check it doesn't break anything as firewall changes can b" [puppet] - 10https://gerrit.wikimedia.org/r/505407 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [09:15:19] (03PS1) 10Marostegui: db1093: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/507264 (https://phabricator.wikimedia.org/T222127) [09:15:21] (03CR) 10Marostegui: "jcrespo maybe you want to push this Thursday instead of today given that tomorrow is bank holiday and maybe we want to leave this host run" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507263 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui) [09:15:23] (03CR) 10Marostegui: "jcrespo maybe you want to push this Thursday instead of today given that tomorrow is bank holiday and maybe we want to leave this host run" [puppet] - 10https://gerrit.wikimedia.org/r/507264 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui) [09:15:25] (03PS2) 10Ema: Add profile::cache::varnish::frontend::text [puppet] - 10https://gerrit.wikimedia.org/r/507022 (https://phabricator.wikimedia.org/T219967) [09:18:28] (03CR) 10Ema: [C: 03+2] Add profile::cache::varnish::frontend::text [puppet] - 10https://gerrit.wikimedia.org/r/507022 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [09:19:38] (03CR) 10Gehel: "minor style issue, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506378 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [09:20:57] (03CR) 10Volans: [C: 03+2] setup.py: force urllib3 version [software/spicerack] - 10https://gerrit.wikimedia.org/r/506966 (owner: 10Volans) [09:23:01] (03PS9) 10Fsero: registryha: feat: introducing LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/506367 (https://phabricator.wikimedia.org/T221101) [09:23:53] 10Operations, 10cloud-services-team: labtestpuppetmaster2001: Broken package state; would upgrade to sysvinit - https://phabricator.wikimedia.org/T222148 (10MoritzMuehlenhoff) [09:25:39] (03CR) 10Fsero: [C: 03+2] registryha: feat: introducing LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/506367 (https://phabricator.wikimedia.org/T221101) (owner: 10Fsero) [09:26:02] (03Merged) 10jenkins-bot: setup.py: force urllib3 version [software/spicerack] - 10https://gerrit.wikimedia.org/r/506966 (owner: 10Volans) [09:26:20] !log creating lvs endpoints for docker registry - T221101 [09:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:24] T221101: migrate endpoint from old registry instance to new one - https://phabricator.wikimedia.org/T221101 [09:26:39] 10Operations, 10cloud-services-team: labpuppetmaster/labtestpuppetmaster2001: Broken package state; would upgrade to sysvinit - https://phabricator.wikimedia.org/T222148 (10MoritzMuehlenhoff) Also applies to labpuppetmaster* [09:26:55] (03CR) 10jenkins-bot: setup.py: force urllib3 version [software/spicerack] - 10https://gerrit.wikimedia.org/r/506966 (owner: 10Volans) [09:26:58] 10Operations, 10cloud-services-team: labpuppetmaster/labtestpuppetmaster2001: Broken package state; would upgrade to sysvinit - https://phabricator.wikimedia.org/T222148 (10MoritzMuehlenhoff) [09:31:22] !log fsero@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=docker-registry,service=docker-registry [09:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:18] (03PS1) 10Ema: cache: reimage cp4022 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/507266 (https://phabricator.wikimedia.org/T219967) [09:33:11] (03CR) 10Gehel: [C: 04-1] "A few more comments inline" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [09:33:46] (03CR) 10Jcrespo: [C: 03+1] "+1 but let's wait a bit." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507263 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui) [09:34:12] (03CR) 10Jcrespo: [C: 03+1] db1093: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/507264 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui) [09:34:55] (03CR) 10Marostegui: "Let's merge this once it gets pooled, no need to page if it is not pooled for now" [puppet] - 10https://gerrit.wikimedia.org/r/507264 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui) [09:35:59] PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.44:443]) https://wikitech.wikimedia.org/wiki/PyBal [09:36:09] PROBLEM - PyBal connections to etcd on lvs1006 is CRITICAL: CRITICAL: 47 connections established with conf1004.eqiad.wmnet:4001 (min=48) https://wikitech.wikimedia.org/wiki/PyBal [09:36:16] ^ this is expected [09:36:31] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.44:443]) https://wikitech.wikimedia.org/wiki/PyBal [09:37:09] PROBLEM - PyBal IPVS diff check on lvs2006 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.44:443]) https://wikitech.wikimedia.org/wiki/PyBal [09:37:19] PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 47 connections established with conf1004.eqiad.wmnet:4001 (min=48) https://wikitech.wikimedia.org/wiki/PyBal [09:39:12] <_joe_> lvs2006 should recover [09:39:54] <_joe_> I see that ip in ipvsadm [09:40:59] PROBLEM - PyBal connections to etcd on lvs2003 is CRITICAL: CRITICAL: 38 connections established with conf2001.codfw.wmnet:2379 (min=39) https://wikitech.wikimedia.org/wiki/PyBal [09:41:03] PROBLEM - PyBal IPVS diff check on lvs2003 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.44:443]) https://wikitech.wikimedia.org/wiki/PyBal [09:42:27] RECOVERY - PyBal IPVS diff check on lvs2006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [09:43:28] 10Operations, 10monitoring, 10Wikimedia-Incident: prometheus: some sort of IRC alerts on restarts? - https://phabricator.wikimedia.org/T222108 (10fgiunchedi) Checking process uptime sounds good to me, if I understood correctly (the one-time icinga notifcation) the alert would self-recover once uptime is no l... [09:43:49] (03PS2) 10Ema: cache: reimage cp4022 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/507266 (https://phabricator.wikimedia.org/T219967) [09:43:52] (03PS1) 10Ema: cache: add ulsfo_ats to cacheproxy::cron_restart [puppet] - 10https://gerrit.wikimedia.org/r/507267 (https://phabricator.wikimedia.org/T219967) [09:46:08] 10Operations, 10monitoring, 10Wikimedia-Incident: figure out why Kafka dashboard hammers Prometheus, and fix it - https://phabricator.wikimedia.org/T222112 (10fgiunchedi) +1! I'm expecting the most effective mitigation to be recording rules, followed by loading less panels [09:46:37] 10Operations, 10cloud-services-team: labpuppetmaster/labtestpuppetmaster2001: Broken package state; would upgrade to sysvinit - https://phabricator.wikimedia.org/T222148 (10MoritzMuehlenhoff) Same for labstore1004/1005 [09:46:52] 10Operations, 10cloud-services-team: labstore100[45]/labpuppetmaster/labtestpuppetmaster2001: Broken package state; would upgrade to sysvinit - https://phabricator.wikimedia.org/T222148 (10MoritzMuehlenhoff) [09:47:10] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/506400 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond) [09:51:35] RECOVERY - PyBal connections to etcd on lvs2003 is OK: OK: 39 connections established with conf2001.codfw.wmnet:2379 (min=39) https://wikitech.wikimedia.org/wiki/PyBal [09:51:39] RECOVERY - PyBal IPVS diff check on lvs2003 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [09:51:51] RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [09:52:05] RECOVERY - PyBal connections to etcd on lvs1006 is OK: OK: 48 connections established with conf1004.eqiad.wmnet:4001 (min=48) https://wikitech.wikimedia.org/wiki/PyBal [09:53:57] (03PS2) 10Muehlenhoff: Ignore libpng for nginx service restarts [puppet] - 10https://gerrit.wikimedia.org/r/507020 [09:54:09] (03PS3) 10Muehlenhoff: Ignore libpng for nginx service restarts [puppet] - 10https://gerrit.wikimedia.org/r/507020 [09:56:29] (03CR) 10Muehlenhoff: [C: 03+2] Ignore libpng for nginx service restarts [puppet] - 10https://gerrit.wikimedia.org/r/507020 (owner: 10Muehlenhoff) [09:57:43] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [09:57:52] (03PS2) 10Muehlenhoff: Move Kerberos Hiera settings to global setting [puppet] - 10https://gerrit.wikimedia.org/r/506647 [09:58:17] PROBLEM - Check systemd state on ms-be2014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:58:27] RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 48 connections established with conf1004.eqiad.wmnet:4001 (min=48) https://wikitech.wikimedia.org/wiki/PyBal [09:58:56] (03PS1) 10ArielGlenn: split up page content jobs with max bytes per page range [dumps] - 10https://gerrit.wikimedia.org/r/507268 (https://phabricator.wikimedia.org/T221504) [09:59:46] (03PS2) 10ArielGlenn: split up page content jobs with max bytes per page range [dumps] - 10https://gerrit.wikimedia.org/r/507268 (https://phabricator.wikimedia.org/T221504) [10:00:04] bah I hoped I'd ^c before the first one went. oh well [10:03:48] (03CR) 10Muehlenhoff: [C: 03+2] Move Kerberos Hiera settings to global setting [puppet] - 10https://gerrit.wikimedia.org/r/506647 (owner: 10Muehlenhoff) [10:05:01] (03PS14) 10Jcrespo: mariadb-snapshots: Setup full daily snapshots for all codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/500980 (https://phabricator.wikimedia.org/T206203) [10:05:03] (03PS12) 10Jcrespo: mariadb-snapshots: Stop replication during transfer [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203) [10:05:05] (03PS3) 10Jcrespo: mariadb-backups: Setup db2097 as the source of some codfw backups [puppet] - 10https://gerrit.wikimedia.org/r/506947 (https://phabricator.wikimedia.org/T206203) [10:05:07] (03PS1) 10Jcrespo: mariadb-backups: Productionize db2100/1 for backup source of s7, s8 & x1 [puppet] - 10https://gerrit.wikimedia.org/r/507269 (https://phabricator.wikimedia.org/T220572) [10:06:05] (03PS4) 10Jbond: facter3/puppet5: update interface fact parsing [puppet] - 10https://gerrit.wikimedia.org/r/506651 (https://phabricator.wikimedia.org/T219803) [10:07:00] (03PS5) 10Jbond: puppet5/facter3: update canary [puppet] - 10https://gerrit.wikimedia.org/r/506652 (https://phabricator.wikimedia.org/T219803) [10:08:42] !log stop s7 and x1 instances on dbstore2* for cloning T220572 [10:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:47] T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host - https://phabricator.wikimedia.org/T220572 [10:08:53] (03PS2) 10Ema: cache: add ATS nodes to cacheproxy::cron_restart [puppet] - 10https://gerrit.wikimedia.org/r/507267 (https://phabricator.wikimedia.org/T219967) [10:10:08] (03PS17) 10Mathew.onipe: elasticsearch: config file for aligning puppet config [puppet] - 10https://gerrit.wikimedia.org/r/506378 (https://phabricator.wikimedia.org/T218932) [10:10:10] (03PS9) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) [10:11:12] (03PS1) 10Arturo Borrero Gonzalez: labtestvirt2003: now a spare system [puppet] - 10https://gerrit.wikimedia.org/r/507270 (https://phabricator.wikimedia.org/T222057) [10:11:38] (03PS1) 10Fsero: registryha,lvs: bug: bad icinga check for lvs service [puppet] - 10https://gerrit.wikimedia.org/r/507271 [10:11:49] (03PS3) 10Ema: cache: add ATS nodes to cacheproxy::cron_restart [puppet] - 10https://gerrit.wikimedia.org/r/507267 (https://phabricator.wikimedia.org/T219967) [10:12:04] !log rollout rsyslog upgrade to 8.1901.0-1~bpo9+wmf1 in eqsin / ulsfo / esams [10:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:11] (03PS2) 10Fsero: registryha,lvs: bug: bad icinga check for lvs service [puppet] - 10https://gerrit.wikimedia.org/r/507271 [10:12:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestvirt2003: now a spare system [puppet] - 10https://gerrit.wikimedia.org/r/507270 (https://phabricator.wikimedia.org/T222057) (owner: 10Arturo Borrero Gonzalez) [10:13:47] (03PS2) 10Jcrespo: mariadb-backups: Productionize db2100/1 for backup source of s7, s8 & x1 [puppet] - 10https://gerrit.wikimedia.org/r/507269 (https://phabricator.wikimedia.org/T220572) [10:14:38] (03PS1) 10Jbond: facter3: fix interface check on lvs systems [puppet] - 10https://gerrit.wikimedia.org/r/507272 [10:14:54] (03PS4) 10Ema: cache: add ATS nodes to cacheproxy::cron_restart [puppet] - 10https://gerrit.wikimedia.org/r/507267 (https://phabricator.wikimedia.org/T219967) [10:14:57] (03CR) 10jerkins-bot: [V: 04-1] facter3: fix interface check on lvs systems [puppet] - 10https://gerrit.wikimedia.org/r/507272 (owner: 10Jbond) [10:15:05] (03CR) 10Marostegui: [C: 03+1] mariadb-backups: Productionize db2100/1 for backup source of s7, s8 & x1 [puppet] - 10https://gerrit.wikimedia.org/r/507269 (https://phabricator.wikimedia.org/T220572) (owner: 10Jcrespo) [10:15:25] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Productionize db2100/1 for backup source of s7, s8 & x1 [puppet] - 10https://gerrit.wikimedia.org/r/507269 (https://phabricator.wikimedia.org/T220572) (owner: 10Jcrespo) [10:15:53] (03PS3) 10Fsero: registryha,lvs: bug: bad icinga check for lvs service [puppet] - 10https://gerrit.wikimedia.org/r/507271 [10:16:59] (03CR) 10Jbond: "rebuild" [puppet] - 10https://gerrit.wikimedia.org/r/507272 (owner: 10Jbond) [10:17:20] (03PS2) 10Jbond: facter3: fix interface check on lvs systems [puppet] - 10https://gerrit.wikimedia.org/r/507272 [10:17:39] PROBLEM - puppet last run on restbase1008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [10:17:48] (03CR) 10Giuseppe Lavagetto: [C: 03+1] registryha,lvs: bug: bad icinga check for lvs service [puppet] - 10https://gerrit.wikimedia.org/r/507271 (owner: 10Fsero) [10:18:09] (03CR) 10Fsero: [C: 03+2] registryha,lvs: bug: bad icinga check for lvs service [puppet] - 10https://gerrit.wikimedia.org/r/507271 (owner: 10Fsero) [10:18:20] (03PS4) 10Fsero: registryha,lvs: bug: bad icinga check for lvs service [puppet] - 10https://gerrit.wikimedia.org/r/507271 [10:19:29] 10Operations, 10serviceops, 10User-Elukey: Renew certs for mcrouter on all application servers. - https://phabricator.wikimedia.org/T221346 (10elukey) Looks sane to me! [10:20:02] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/507272 (owner: 10Jbond) [10:20:14] jynus: is your change good to merge? [10:20:17] (03CR) 10Jbond: [C: 03+2] facter3: fix interface check on lvs systems [puppet] - 10https://gerrit.wikimedia.org/r/507272 (owner: 10Jbond) [10:20:25] fsero: yes [10:20:28] I got distracted [10:20:38] merged [10:20:38] (03PS3) 10Jbond: facter3: fix interface check on lvs systems [puppet] - 10https://gerrit.wikimedia.org/r/507272 [10:22:35] (03PS5) 10Ema: cache: add ATS nodes to cacheproxy::cron_restart [puppet] - 10https://gerrit.wikimedia.org/r/507267 (https://phabricator.wikimedia.org/T219967) [10:22:59] (03PS1) 10Arturo Borrero Gonzalez: labtestservices2003: reimage as spare stretch. [puppet] - 10https://gerrit.wikimedia.org/r/507273 (https://phabricator.wikimedia.org/T222060) [10:23:11] (03PS18) 10Mathew.onipe: elasticsearch: config file for aligning puppet config [puppet] - 10https://gerrit.wikimedia.org/r/506378 (https://phabricator.wikimedia.org/T218932) [10:23:13] (03PS10) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) [10:23:29] PROBLEM - Check Varnish expiry mailbox lag on cp3038 is CRITICAL: CRITICAL: expiry mailbox lag is 2058435 https://wikitech.wikimedia.org/wiki/Varnish [10:24:20] looking ^ [10:24:50] (03PS2) 10Arturo Borrero Gonzalez: labtestservices2003: reimage as spare stretch. [puppet] - 10https://gerrit.wikimedia.org/r/507273 (https://phabricator.wikimedia.org/T222060) [10:26:06] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestservices2003: reimage as spare stretch. [puppet] - 10https://gerrit.wikimedia.org/r/507273 (https://phabricator.wikimedia.org/T222060) (owner: 10Arturo Borrero Gonzalez) [10:27:27] cp3038's scheduled backend restart is in 2 hours and it's not failing yet, waiting for cron to restart it [10:28:47] (03CR) 10Mathew.onipe: elasticsearch: config file for aligning puppet config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506378 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [10:32:18] !log T222057 reimaged labtestvirt2003 as spare system [10:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:24] T222057: labtestvirt2003.codfw.wmnet: reimage as spare stretch - https://phabricator.wikimedia.org/T222057 [10:32:43] !log T222060 reimaged labtestservices2003 as stretch spare system [10:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:47] T222060: labtestservices2003.wikimedia.org: reimage as spare stretch - https://phabricator.wikimedia.org/T222060 [10:32:56] !log rollout rsyslog upgrade to 8.1901.0-1~bpo9+wmf1 in codfw [10:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:24] RECOVERY - Check systemd state on ms-be2014 is OK: OK - running: The system is fully operational [10:39:28] (03PS2) 10Arturo Borrero Gonzalez: vagrant: refactor roles into profiles [puppet] - 10https://gerrit.wikimedia.org/r/507005 (https://phabricator.wikimedia.org/T221225) [10:41:06] ACKNOWLEDGEMENT - docker-registry LVS codfw on docker-registry.svc.codfw.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string \{\} not found on https://docker-registry.svc.codfw.wmnet:443/v2/ - 292 bytes in 0.159 second response time Fsero bad icinga check https://wikitech.wikimedia.org/wiki/Docker-registry-runbook [10:41:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] vagrant: refactor roles into profiles [puppet] - 10https://gerrit.wikimedia.org/r/507005 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [10:42:22] (03PS6) 10Jbond: puppet5/facter3: update canary [puppet] - 10https://gerrit.wikimedia.org/r/506652 (https://phabricator.wikimedia.org/T219803) [10:43:19] (03CR) 10Jbond: [C: 03+2] puppet5/facter3: update canary [puppet] - 10https://gerrit.wikimedia.org/r/506652 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [10:44:20] (03PS5) 10Arturo Borrero Gonzalez: cloudvps: introduce proper base role/profile for VM instances [puppet] - 10https://gerrit.wikimedia.org/r/506979 (https://phabricator.wikimedia.org/T221225) [10:45:37] !log santhosh@deploy1001 scap-helm cxserver upgrade -f cxserver-staging-values.yaml staging stable/cxserver [namespace: cxserver, clusters: staging] [10:45:38] !log santhosh@deploy1001 scap-helm cxserver cluster staging completed [10:45:38] !log santhosh@deploy1001 scap-helm cxserver finished [10:45:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC as expected: https://puppet-compiler.wmflabs.org/compiler1002/16196/" [puppet] - 10https://gerrit.wikimedia.org/r/506979 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [10:48:23] RECOVERY - puppet last run on restbase1008 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:48:38] !log santhosh@deploy1001 scap-helm cxserver upgrade -f cxserver-eqiad-values.yaml production stable/cxserver [namespace: cxserver, clusters: eqiad] [10:48:39] !log santhosh@deploy1001 scap-helm cxserver cluster eqiad completed [10:48:39] !log santhosh@deploy1001 scap-helm cxserver finished [10:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:19] !log santhosh@deploy1001 scap-helm cxserver upgrade -f cxserver-codfw-values.yaml production stable/cxserver [namespace: cxserver, clusters: codfw] [10:49:20] !log santhosh@deploy1001 scap-helm cxserver cluster codfw completed [10:49:20] !log santhosh@deploy1001 scap-helm cxserver finished [10:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:57] PROBLEM - HHVM jobrunner on mw1296 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [10:50:03] PROBLEM - HHVM jobrunner on mw1294 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [10:51:01] RECOVERY - HHVM jobrunner on mw1296 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [10:51:07] RECOVERY - HHVM jobrunner on mw1294 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [10:51:41] (03PS1) 10Volans: prometheus: fix base URL template [software/spicerack] - 10https://gerrit.wikimedia.org/r/507279 [10:51:43] (03PS1) 10Volans: doc: autodoc missing API modules [software/spicerack] - 10https://gerrit.wikimedia.org/r/507280 [10:51:45] (03PS1) 10Volans: doc: add checker to ensure modules are documented [software/spicerack] - 10https://gerrit.wikimedia.org/r/507281 [10:52:38] (03PS1) 10Fsero: registryha,lvs: bug: modifying check LVS [puppet] - 10https://gerrit.wikimedia.org/r/507282 (https://phabricator.wikimedia.org/T221101) [10:53:35] (03CR) 10Jbond: [C: 03+1] "good catch lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/507279 (owner: 10Volans) [10:53:44] (03CR) 10Fsero: [C: 03+2] registryha,lvs: bug: modifying check LVS [puppet] - 10https://gerrit.wikimedia.org/r/507282 (https://phabricator.wikimedia.org/T221101) (owner: 10Fsero) [10:55:22] !log Updated cxserver to 2019-04-30-055331-production (T219412) [10:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:26] T219412: CX2: Do not translate reference contents - https://phabricator.wikimedia.org/T219412 [10:55:34] PROBLEM - HHVM jobrunner on mw1299 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [10:56:03] (03CR) 10jerkins-bot: [V: 04-1] doc: add checker to ensure modules are documented [software/spicerack] - 10https://gerrit.wikimedia.org/r/507281 (owner: 10Volans) [10:56:34] RECOVERY - HHVM jobrunner on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [10:56:37] (03PS6) 10Ema: cache: add ATS nodes to cacheproxy::cron_restart [puppet] - 10https://gerrit.wikimedia.org/r/507267 (https://phabricator.wikimedia.org/T219967) [10:57:14] (03PS4) 10Lucas Werkmeister (WMDE): Serialize empty lists as objects on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507031 (https://phabricator.wikimedia.org/T138104) [10:57:16] (03PS5) 10Lucas Werkmeister (WMDE): Serialize empty lists as objects on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507032 (https://phabricator.wikimedia.org/T138104) [10:57:45] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:57:52] (03PS7) 10Ema: cache: add ATS nodes to cacheproxy::cron_restart [puppet] - 10https://gerrit.wikimedia.org/r/507267 (https://phabricator.wikimedia.org/T219967) [10:58:45] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:59:01] (03PS2) 10Volans: doc: add checker to ensure modules are documented [software/spicerack] - 10https://gerrit.wikimedia.org/r/507281 [10:59:14] (03CR) 10Ema: [C: 03+2] cache: add ATS nodes to cacheproxy::cron_restart [puppet] - 10https://gerrit.wikimedia.org/r/507267 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [11:00:04] MaxSem, RoanKattouw, and Niharika: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190430T1100). [11:00:04] Lucas_WMDE and mlitn: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:11] o/ [11:00:15] check [11:01:00] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: migrate endpoint from old registry instance to new one - https://phabricator.wikimedia.org/T221101 (10fsero) [11:01:01] can I start with my patches? [11:01:17] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: migrate endpoint from old registry instance to new one - https://phabricator.wikimedia.org/T221101 (10fsero) [11:01:41] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507031 (https://phabricator.wikimedia.org/T138104) (owner: 10Lucas Werkmeister (WMDE)) [11:01:43] (03CR) 10Volans: [C: 03+2] prometheus: fix base URL template [software/spicerack] - 10https://gerrit.wikimedia.org/r/507279 (owner: 10Volans) [11:01:51] okay I’m starting [11:02:37] sure [11:02:49] (03CR) 10jerkins-bot: [V: 04-1] Serialize empty lists as objects on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507031 (https://phabricator.wikimedia.org/T138104) (owner: 10Lucas Werkmeister (WMDE)) [11:02:54] !log cp3038 mbox lag, restarting varnish-be [11:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:34] huh https://integration.wikimedia.org/ci/job/operations-mw-config-typos-docker/4346/console [11:03:42] oh no [11:04:02] not this thing? T222131 [11:04:02] T222131: mediawiki-quibble-composertest-php70-docker failure: Unable to find image 'docker-registry.wikimedia.org/releng/castor:0.2.0' locally - https://phabricator.wikimedia.org/T222131 [11:04:38] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+2] Fix coal syslog logging [puppet] - 10https://gerrit.wikimedia.org/r/506626 (https://phabricator.wikimedia.org/T221401) (owner: 10Gilles) [11:04:49] (03PS4) 10Effie Mouzeli: Fix coal syslog logging [puppet] - 10https://gerrit.wikimedia.org/r/506626 (https://phabricator.wikimedia.org/T221401) (owner: 10Gilles) [11:05:06] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 2 others: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10Gilles) I've tracked down the root cause of the issue: https://github.com/dpkp/kafka-python/issues/1774 For other uses of python-kafka we have, we simp... [11:05:22] let’s see if any other config patches got successfully merged recently [11:05:28] (03CR) 10Muehlenhoff: "| tbh i think for notebook and possibly the stat servers it may be easier to exclude everything unless explicitly requested?" [puppet] - 10https://gerrit.wikimedia.org/r/507056 (owner: 10Jbond) [11:07:05] (03CR) 10Lucas Werkmeister (WMDE): "let’s retry that" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507031 (https://phabricator.wikimedia.org/T138104) (owner: 10Lucas Werkmeister (WMDE)) [11:07:07] (03Merged) 10jenkins-bot: prometheus: fix base URL template [software/spicerack] - 10https://gerrit.wikimedia.org/r/507279 (owner: 10Volans) [11:07:09] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507031 (https://phabricator.wikimedia.org/T138104) (owner: 10Lucas Werkmeister (WMDE)) [11:08:02] (03CR) 10jenkins-bot: prometheus: fix base URL template [software/spicerack] - 10https://gerrit.wikimedia.org/r/507279 (owner: 10Volans) [11:08:27] (03Merged) 10jenkins-bot: Serialize empty lists as objects on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507031 (https://phabricator.wikimedia.org/T138104) (owner: 10Lucas Werkmeister (WMDE)) [11:08:31] !log gilles@deploy1001 Started deploy [performance/navtiming@d6756c0]: T221848 Proper fix for partitions_for_topic in python-kafka > 1.4.4 [11:08:37] !log gilles@deploy1001 Finished deploy [performance/navtiming@d6756c0]: T221848 Proper fix for partitions_for_topic in python-kafka > 1.4.4 (duration: 00m 05s) [11:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:37] T221848: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 [11:08:41] (03CR) 10jenkins-bot: Serialize empty lists as objects on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507031 (https://phabricator.wikimedia.org/T138104) (owner: 10Lucas Werkmeister (WMDE)) [11:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:45] okay looks like it worked this time [11:08:52] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Allow cross-site requests from mobile domains (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506151 (https://phabricator.wikimedia.org/T221734) (owner: 10Matthias Mullie) [11:09:24] Lucas_WMDE: your first patch is on mwdebug1002, please test [11:10:14] working as expected, deploying [11:12:00] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:507031|Serialize empty lists as objects on Wikidata (T138104)]] (duration: 00m 55s) [11:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:04] T138104: Do not serialize empty containers (descriptions/aliases/sitelinks) as empty array [] - https://phabricator.wikimedia.org/T138104 [11:12:24] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507032 (https://phabricator.wikimedia.org/T138104) (owner: 10Lucas Werkmeister (WMDE)) [11:13:05] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 2 others: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10Gilles) [11:13:28] (03Merged) 10jenkins-bot: Serialize empty lists as objects on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507032 (https://phabricator.wikimedia.org/T138104) (owner: 10Lucas Werkmeister (WMDE)) [11:13:58] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 2 others: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10Gilles) @Ottomata all our services are good now, you can go ahead with upgrading EventLogging and Hadoop. [11:14:01] Lucas_WMDE: your second patch is on mwdebug1002, please test [11:14:11] RECOVERY - Check Varnish expiry mailbox lag on cp3038 is OK: OK: expiry mailbox lag is 0 https://wikitech.wikimedia.org/wiki/Varnish [11:14:31] also working as expected, deploying [11:15:48] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:507032|Serialize empty lists as objects on Commons (T138104)]] (duration: 00m 54s) [11:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:56] okay, I’m done [11:16:04] matthiasmullie: can you deploy your own change? [11:16:15] yeah sure [11:16:18] ok [11:16:23] thanks [11:16:56] (03PS4) 10Matthias Mullie: Allow cross-site requests from mobile domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506151 (https://phabricator.wikimedia.org/T221734) [11:17:06] 10Operations, 10Tools, 10cloud-services-team (Kanban): Rebuild toollabs docker images based on wikimedia-jessie - https://phabricator.wikimedia.org/T219751 (10hashar) On a freshly created instance that causes apt to fail and causes the puppet-agent-cronjob to fail: ` Apr 30 06:15:02 integration-slave-docker-... [11:18:22] (03CR) 10Matthias Mullie: [C: 03+2] Allow cross-site requests from mobile domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506151 (https://phabricator.wikimedia.org/T221734) (owner: 10Matthias Mullie) [11:18:24] (03PS14) 10Arturo Borrero Gonzalez: sudo: decouple sudo from sudo-ldap [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225) [11:18:37] 10Operations, 10Puppet, 10Icinga, 10monitoring, 10Patch-For-Review: Puppet failing without Icinga alert in case of dependency cycle - https://phabricator.wikimedia.org/T221784 (10ema) [11:18:56] (03CR) 10jerkins-bot: [V: 04-1] sudo: decouple sudo from sudo-ldap [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [11:19:20] (03Merged) 10jenkins-bot: Allow cross-site requests from mobile domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506151 (https://phabricator.wikimedia.org/T221734) (owner: 10Matthias Mullie) [11:20:07] (03CR) 10jenkins-bot: Serialize empty lists as objects on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507032 (https://phabricator.wikimedia.org/T138104) (owner: 10Lucas Werkmeister (WMDE)) [11:20:09] (03CR) 10jenkins-bot: Allow cross-site requests from mobile domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506151 (https://phabricator.wikimedia.org/T221734) (owner: 10Matthias Mullie) [11:20:26] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond) >>! In T219803#5139829, @MoritzMuehlenhoff wrote: > One thing that will need to be fixed is the detection of HP machines to install 'hp-health' in module... [11:21:54] (03PS15) 10Arturo Borrero Gonzalez: sudo: decouple sudo from sudo-ldap [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225) [11:21:59] 10Operations, 10Puppet: facter3: use structured facts - https://phabricator.wikimedia.org/T222160 (10jbond) p:05Triage→03Normal [11:22:28] !log mlitn@deploy1001 Synchronized wmf-config/CommonSettings.php: Allow cross-site requests from mobile domains (duration: 00m 52s) [11:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:45] (03CR) 10jerkins-bot: [V: 04-1] sudo: decouple sudo from sudo-ldap [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [11:24:19] (03PS1) 10Ema: cache: always define fe_transient_storage [puppet] - 10https://gerrit.wikimedia.org/r/507291 [11:24:52] (03CR) 10jerkins-bot: [V: 04-1] cache: always define fe_transient_storage [puppet] - 10https://gerrit.wikimedia.org/r/507291 (owner: 10Ema) [11:25:12] 10Operations, 10Puppet, 10Icinga, 10monitoring, 10Patch-For-Review: Puppet failing without Icinga alert in case of dependency cycle - https://phabricator.wikimedia.org/T221784 (10Volans) After a bit of digging with @ema we found that in this case the `/var/lib/puppet/state/last_run_summary.yaml` file is... [11:25:55] (03PS2) 10Ema: cache: always define fe_transient_storage [puppet] - 10https://gerrit.wikimedia.org/r/507291 [11:26:38] I'm done [11:26:43] (03CR) 10jerkins-bot: [V: 04-1] cache: always define fe_transient_storage [puppet] - 10https://gerrit.wikimedia.org/r/507291 (owner: 10Ema) [11:26:57] me too, and nothing else in the deployment calendar, so [11:27:00] !log EU SWAT done [11:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:45] (03PS3) 10Ema: cache: always define fe_transient_storage [puppet] - 10https://gerrit.wikimedia.org/r/507291 [11:29:50] (03CR) 10Ema: [C: 03+2] cache: always define fe_transient_storage [puppet] - 10https://gerrit.wikimedia.org/r/507291 (owner: 10Ema) [11:35:22] 10Operations, 10Continuous-Integration-Infrastructure: Jessie rsyslog_8.1901.0-1~bpo8+wmf1_amd64.deb package fails to upgrade - https://phabricator.wikimedia.org/T222166 (10hashar) [11:36:39] (03PS3) 10Ema: cache: reimage cp4022 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/507266 (https://phabricator.wikimedia.org/T219967) [11:40:43] (03PS11) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) [11:41:04] (03CR) 10Mathew.onipe: icinga: create and apply cirrus config check (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [11:44:40] (03PS1) 10Elukey: Deprecate GELF logging [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507297 [11:48:05] (03CR) 10Gilles: "@ema anything else you need me to do on this patch?" [puppet] - 10https://gerrit.wikimedia.org/r/495643 (https://phabricator.wikimedia.org/T216339) (owner: 10Gilles) [11:53:23] (03PS16) 10Arturo Borrero Gonzalez: sudo: decouple sudo from sudo-ldap [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225) [11:58:54] PROBLEM - puppet last run on mw1348 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190430T1200) [12:00:10] (03PS1) 10Jbond: facter3/puppet5: enable puppet5/facter3 eqsin [puppet] - 10https://gerrit.wikimedia.org/r/507299 (https://phabricator.wikimedia.org/T219803) [12:00:12] (03PS1) 10Jbond: facter3/puppet5: enable puppet5/facter3 esams [puppet] - 10https://gerrit.wikimedia.org/r/507300 (https://phabricator.wikimedia.org/T219803) [12:00:14] (03PS1) 10Jbond: facter3/puppet5: enable puppet5/facter3 ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/507301 (https://phabricator.wikimedia.org/T219803) [12:00:18] (03PS1) 10Jbond: facter3/puppet5: enable puppet5/facter3 codfw [puppet] - 10https://gerrit.wikimedia.org/r/507302 (https://phabricator.wikimedia.org/T219803) [12:00:20] (03PS1) 10Jbond: facter3/puppet5: enable puppet5/facter3 eqiad [puppet] - 10https://gerrit.wikimedia.org/r/507303 (https://phabricator.wikimedia.org/T219803) [12:00:23] (03PS1) 10Jbond: facter3/puppet5: change default versions [puppet] - 10https://gerrit.wikimedia.org/r/507304 [12:00:25] (03PS1) 10Jbond: facter3/puppet5: clean up old config [puppet] - 10https://gerrit.wikimedia.org/r/507305 (https://phabricator.wikimedia.org/T219803) [12:02:37] who should I contact to get https://gerrit.wikimedia.org/r/c/operations/puppet/+/506895 merged? [12:02:40] (03PS15) 10Jcrespo: mariadb-snapshots: Setup full daily snapshots for all codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/500980 (https://phabricator.wikimedia.org/T206203) [12:02:52] (03PS13) 10Jcrespo: mariadb-snapshots: Stop replication during transfer [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203) [12:02:54] (03PS4) 10Jcrespo: mariadb-backups: Setup new backup source hosts for codfw backups [puppet] - 10https://gerrit.wikimedia.org/r/506947 (https://phabricator.wikimedia.org/T206203) [12:03:32] (03PS1) 10Elukey: Enable hdfs-audit log [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507306 [12:03:34] 10Operations, 10Performance-Team, 10Traffic: Send peering requests to AS with the worst TTFB - https://phabricator.wikimedia.org/T219486 (10Gilles) p:05Normal→03Low [12:03:51] (03CR) 10Jcrespo: [C: 04-2] "Pending complete setup of db2100." [puppet] - 10https://gerrit.wikimedia.org/r/506947 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [12:07:22] (03CR) 10Joal: "One comment on comment :)" (031 comment) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507306 (owner: 10Elukey) [12:11:10] (03PS1) 10Gilles: Proxy Thumbor 404s as-is [puppet] - 10https://gerrit.wikimedia.org/r/507307 (https://phabricator.wikimedia.org/T222071) [12:11:48] (03CR) 10jerkins-bot: [V: 04-1] Proxy Thumbor 404s as-is [puppet] - 10https://gerrit.wikimedia.org/r/507307 (https://phabricator.wikimedia.org/T222071) (owner: 10Gilles) [12:14:37] (03PS2) 10Gilles: Proxy Thumbor 404s as-is [puppet] - 10https://gerrit.wikimedia.org/r/507307 (https://phabricator.wikimedia.org/T222071) [12:19:23] (03PS2) 10Elukey: Enable hdfs-audit log [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507306 [12:21:22] (03CR) 10Joal: [C: 03+1] "LGTM!" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507306 (owner: 10Elukey) [12:25:20] RECOVERY - puppet last run on mw1348 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:26:55] (03PS2) 10CDanis: prometheus: 10M max-samples for all instances [puppet] - 10https://gerrit.wikimedia.org/r/507210 (https://phabricator.wikimedia.org/T222105) [12:30:21] (03CR) 10Effie Mouzeli: [C: 03+1] profile::mediawiki::nutcracker: make memcached configuration optional [puppet] - 10https://gerrit.wikimedia.org/r/504831 (https://phabricator.wikimedia.org/T214275) (owner: 10Elukey) [12:32:41] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'R:prometheus::server' 'disable-puppet "staged rollout T222105 by cdanis"' [12:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:46] T222105: prometheus: current query limits are insufficient to prevent OOMs - https://phabricator.wikimedia.org/T222105 [12:32:53] (03CR) 10CDanis: [C: 03+2] prometheus: 10M max-samples for all instances [puppet] - 10https://gerrit.wikimedia.org/r/507210 (https://phabricator.wikimedia.org/T222105) (owner: 10CDanis) [12:34:52] !log moved /home to /srv/home (more space in a dedicated partition) on stat1005 [12:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:59] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/506719 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [12:36:46] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [12:37:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC as expected: https://puppet-compiler.wmflabs.org/compiler1002/16201/" [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [12:37:34] (03PS17) 10Arturo Borrero Gonzalez: sudo: decouple sudo from sudo-ldap [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225) [12:39:05] !log cdanis@prometheus1004.eqiad.wmnet ~ % sudo run-puppet-agent --enable "staged rollout T222105 by cdanis" [12:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:09] T222105: prometheus: current query limits are insufficient to prevent OOMs - https://phabricator.wikimedia.org/T222105 [12:41:17] !log merging a sudo puppet module change [12:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:05] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/506750 (owner: 10Dzahn) [12:46:14] PROBLEM - HHVM jobrunner on mw1301 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [12:47:01] !log cdanis@prometheus1003.eqiad.wmnet ~ % sudo run-puppet-agent --enable "staged rollout T222105 by cdanis" [12:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:05] T222105: prometheus: current query limits are insufficient to prevent OOMs - https://phabricator.wikimedia.org/T222105 [12:47:30] RECOVERY - HHVM jobrunner on mw1301 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [12:48:24] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [12:51:31] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/507095 (owner: 10Alex Monk) [12:53:18] PROBLEM - HHVM jobrunner on mw1309 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [12:54:38] RECOVERY - HHVM jobrunner on mw1309 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [12:55:12] (03CR) 10Ottomata: [C: 03+1] ":)" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507250 (https://phabricator.wikimedia.org/T220702) (owner: 10Elukey) [12:55:41] (03CR) 10Ottomata: [C: 03+1] Deprecate GELF logging [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507297 (owner: 10Elukey) [12:56:04] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/507280 (owner: 10Volans) [12:58:36] PROBLEM - Check systemd state on ms-be2015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:59:33] (03CR) 10Volans: [C: 03+2] doc: autodoc missing API modules [software/spicerack] - 10https://gerrit.wikimedia.org/r/507280 (owner: 10Volans) [13:00:05] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190430T1300) [13:02:51] !log OOMed the eqiad ops prometheus @ prometheus1004 [13:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:10] PROBLEM - configured eth on prometheus1004 is CRITICAL: connect to address 10.64.16.38 port 5666: Connection refused [13:03:16] PROBLEM - Check size of conntrack table on prometheus1004 is CRITICAL: connect to address 10.64.16.38 port 5666: Connection refused [13:03:32] PROBLEM - Check whether ferm is active by checking the default input chain on prometheus1004 is CRITICAL: connect to address 10.64.16.38 port 5666: Connection refused [13:03:38] PROBLEM - Disk space on prometheus1004 is CRITICAL: connect to address 10.64.16.38 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [13:04:00] PROBLEM - DPKG on prometheus1004 is CRITICAL: connect to address 10.64.16.38 port 5666: Connection refused [13:04:01] (03Merged) 10jenkins-bot: doc: autodoc missing API modules [software/spicerack] - 10https://gerrit.wikimedia.org/r/507280 (owner: 10Volans) [13:04:06] PROBLEM - Check systemd state on prometheus1004 is CRITICAL: connect to address 10.64.16.38 port 5666: Connection refused [13:04:20] PROBLEM - dhclient process on prometheus1004 is CRITICAL: connect to address 10.64.16.38 port 5666: Connection refused [13:04:54] (03CR) 10jenkins-bot: doc: autodoc missing API modules [software/spicerack] - 10https://gerrit.wikimedia.org/r/507280 (owner: 10Volans) [13:06:04] (03PS1) 10Ottomata: Add cumin aliasaes for schema* [puppet] - 10https://gerrit.wikimedia.org/r/507312 (https://phabricator.wikimedia.org/T219556) [13:07:00] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventBus, and 5 others: Create schema[12]00[12] (schema.svc.{eqiad,codfw}.wmnet) - https://phabricator.wikimedia.org/T219556 (10Ottomata) Done, thanks. Also edited https://wikitech.wikimedia.org/wiki/Ganeti#Assign_a_hostname%2FIP with instructions for futur... [13:07:50] PROBLEM - puppet last run on prometheus1004 is CRITICAL: connect to address 10.64.16.38 port 5666: Connection refused [13:08:03] !log OOMed the eqiad ops prometheus @ prometheus1003 [13:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:08] PROBLEM - Check the NTP synchronisation status of timesyncd on prometheus1004 is CRITICAL: connect to address 10.64.16.38 port 5666: Connection refused [13:14:36] PROBLEM - puppet last run on mw1321 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:15:09] !log cdanis@prometheus1003.eqiad.wmnet ~ % sudo disable-puppet 'cdanis testing original query.max-samples T222105' [13:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:14] T222105: prometheus: current query limits are insufficient to prevent OOMs - https://phabricator.wikimedia.org/T222105 [13:15:48] (03PS1) 10Ema: varnish: add Vagrantfile to run VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/507316 [13:15:50] (03PS1) 10Arturo Borrero Gonzalez: Revert "sudo: decouple sudo from sudo-ldap" [puppet] - 10https://gerrit.wikimedia.org/r/507317 (https://phabricator.wikimedia.org/T221225) [13:16:29] !log cdanis@prometheus1003.eqiad.wmnet ~ % sudo systemctl restart prometheus@ops.service [13:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:42] RECOVERY - Disk space on prometheus1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [13:17:01] (03CR) 10jerkins-bot: [V: 04-1] Revert "sudo: decouple sudo from sudo-ldap" [puppet] - 10https://gerrit.wikimedia.org/r/507317 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [13:17:06] RECOVERY - DPKG on prometheus1004 is OK: All packages OK [13:17:10] RECOVERY - Check systemd state on prometheus1004 is OK: OK - running: The system is fully operational [13:17:24] RECOVERY - dhclient process on prometheus1004 is OK: PROCS OK: 0 processes with command name dhclient [13:17:34] RECOVERY - configured eth on prometheus1004 is OK: OK - interfaces up [13:17:38] RECOVERY - Check size of conntrack table on prometheus1004 is OK: OK: nf_conntrack is 3 % full [13:17:49] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] "Ignoring jenkins-bot, the linter is complainig about stuff that were fixed in my patch attempt that I'm now reverting." [puppet] - 10https://gerrit.wikimedia.org/r/507317 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [13:17:56] RECOVERY - Check whether ferm is active by checking the default input chain on prometheus1004 is OK: OK ferm input default policy is set [13:18:28] RECOVERY - puppet last run on prometheus1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:20:38] (03CR) 10Gilles: [C: 03+1] "Works perfectly, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/507316 (owner: 10Ema) [13:20:40] !log reverting sudo puppet module changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/507317 [13:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:17] (03CR) 10Ema: [C: 03+1] "Looks reasonable to me. Commit message nit." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507307 (https://phabricator.wikimedia.org/T222071) (owner: 10Gilles) [13:23:37] (03PS2) 10Ema: varnish: add Vagrantfile to run VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/507316 [13:24:18] (03CR) 10Ema: [C: 03+2] varnish: add Vagrantfile to run VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/507316 (owner: 10Ema) [13:24:23] 10Operations, 10Cassandra, 10Core Platform Team Kanban (Done with CPT), 10Services (done), 10User-Eevans: Revisit default settings for c-foreach-restart - https://phabricator.wikimedia.org/T198787 (10MoritzMuehlenhoff) The version of c-foreach-restart as currently deployed on restbase* doesn't seem to us... [13:24:51] (03CR) 10Jbond: [C: 03+1] "LGTM added a minor comment" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/507281 (owner: 10Volans) [13:25:11] (03CR) 10Arturo Borrero Gonzalez: "This change was reverted in https://gerrit.wikimedia.org/r/c/operations/puppet/+/507317" [puppet] - 10https://gerrit.wikimedia.org/r/506435 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [13:28:19] !log depool cp4022 and reimage as upload_ats T219967 [13:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:23] T219967: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 [13:29:53] !log cdanis@prometheus1004.eqiad.wmnet ~ % sudo systemctl restart prometheus@ops.service [13:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:03] (03PS4) 10Ema: cache: reimage cp4022 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/507266 (https://phabricator.wikimedia.org/T219967) [13:32:29] (03CR) 10Ema: [C: 03+2] cache: reimage cp4022 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/507266 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [13:35:04] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp4022.ulsfo.wmnet'] ` The log can be... [13:35:16] RECOVERY - Check systemd state on ms-be2015 is OK: OK - running: The system is fully operational [13:40:04] (03CR) 10Muehlenhoff: [C: 03+1] facter3/puppet5: enable puppet5/facter3 eqsin [puppet] - 10https://gerrit.wikimedia.org/r/507299 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [13:40:15] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 2 others: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10Ottomata) [13:40:27] (03CR) 10Muehlenhoff: [C: 03+1] facter3/puppet5: enable puppet5/facter3 esams [puppet] - 10https://gerrit.wikimedia.org/r/507300 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [13:40:42] (03CR) 10Muehlenhoff: [C: 03+1] facter3/puppet5: enable puppet5/facter3 ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/507301 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [13:40:56] (03CR) 10Muehlenhoff: [C: 03+1] facter3/puppet5: enable puppet5/facter3 codfw [puppet] - 10https://gerrit.wikimedia.org/r/507302 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [13:41:01] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 2 others: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10Ottomata) [13:41:08] (03CR) 10Muehlenhoff: [C: 03+1] facter3/puppet5: enable puppet5/facter3 eqiad [puppet] - 10https://gerrit.wikimedia.org/r/507303 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [13:42:55] (03CR) 10Muehlenhoff: [C: 03+1] facter3/puppet5: change default versions [puppet] - 10https://gerrit.wikimedia.org/r/507304 (owner: 10Jbond) [13:43:56] (03PS3) 10Gilles: Proxy Thumbor errors as-is [puppet] - 10https://gerrit.wikimedia.org/r/507307 (https://phabricator.wikimedia.org/T222071) [13:44:04] (03CR) 10Ottomata: [C: 03+1] "One nit but +1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506719 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [13:44:08] RECOVERY - Check the NTP synchronisation status of timesyncd on prometheus1004 is OK: OK: synced at Tue 2019-04-30 13:44:07 UTC. [13:44:09] (03CR) 10Gilles: Proxy Thumbor errors as-is (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507307 (https://phabricator.wikimedia.org/T222071) (owner: 10Gilles) [13:44:40] (03CR) 10Ottomata: [C: 03+1] kafkatee: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/506557 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [13:45:48] RECOVERY - puppet last run on mw1321 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:46:48] (03CR) 10Ottomata: [C: 03+1] Enable hdfs-audit log [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507306 (owner: 10Elukey) [13:48:33] (03CR) 10Jbond: [C: 03+1] "lgtm" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/506171 (owner: 10Muehlenhoff) [13:48:53] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/507307 (https://phabricator.wikimedia.org/T222071) (owner: 10Gilles) [13:51:08] 10Operations, 10Continuous-Integration-Infrastructure: Jessie rsyslog_8.1901.0-1~bpo8+wmf1_amd64.deb package fails to upgrade - https://phabricator.wikimedia.org/T222166 (10fgiunchedi) Indeed, we've ran into the same problem on {T219764}. tl;dr the solution is to repeat the install: `apt install rsyslog rsyslo... [13:51:48] (03PS6) 10Jbond: kafka shipper: move kafka rsyslog shipping to base profile [puppet] - 10https://gerrit.wikimedia.org/r/505737 (https://phabricator.wikimedia.org/T220987) [13:52:25] (03PS4) 10Jbond: kafka shipper: add ulogd to kafka forwarding rules [puppet] - 10https://gerrit.wikimedia.org/r/505750 (https://phabricator.wikimedia.org/T220987) [13:52:50] (03PS5) 10Jbond: logstash: add ulog parser to logstash [puppet] - 10https://gerrit.wikimedia.org/r/506400 (https://phabricator.wikimedia.org/T220987) [13:55:13] (03PS1) 10BBlack: Revert "Add CNAME-variant langlist template" [dns] - 10https://gerrit.wikimedia.org/r/507321 (https://phabricator.wikimedia.org/T208263) [13:55:16] (03PS1) 10Volans: doc: mark Sphinx warnings as error [software/spicerack] - 10https://gerrit.wikimedia.org/r/507322 [13:55:52] (03CR) 10BBlack: [C: 03+2] Revert "Add CNAME-variant langlist template" [dns] - 10https://gerrit.wikimedia.org/r/507321 (https://phabricator.wikimedia.org/T208263) (owner: 10BBlack) [13:55:54] (03CR) 10Volans: [C: 03+2] doc: add checker to ensure modules are documented [software/spicerack] - 10https://gerrit.wikimedia.org/r/507281 (owner: 10Volans) [13:57:28] (03PS4) 10BBlack: wm.org no-op cleanup: no empty left-hand-side [dns] - 10https://gerrit.wikimedia.org/r/501610 [13:57:28] (03PS4) 10BBlack: wm.org no-op cleanup: prefer @ to wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/501611 [13:57:30] (03PS4) 10BBlack: wm.org no-op cleanup: Group on hostnames for meta [dns] - 10https://gerrit.wikimedia.org/r/501612 [13:57:32] (03PS4) 10BBlack: wm.org no-op cleanup: re-arrange top section [dns] - 10https://gerrit.wikimedia.org/r/501613 [13:57:44] (03PS2) 10BBlack: wm.org no-op cleanup: move other meta up from end [dns] - 10https://gerrit.wikimedia.org/r/507093 [13:57:45] (03PS4) 10BBlack: ns[012].wikimedia.org: 1D TTLs [dns] - 10https://gerrit.wikimedia.org/r/501614 [13:57:47] (03PS5) 10BBlack: wm.org: 1h non-dyna records for foo.dcname entries [dns] - 10https://gerrit.wikimedia.org/r/501615 [13:59:11] (03Abandoned) 10Ema: varnishtest: mock VCL configuration [puppet] - 10https://gerrit.wikimedia.org/r/340511 (owner: 10Ema) [14:01:09] (03Merged) 10jenkins-bot: doc: add checker to ensure modules are documented [software/spicerack] - 10https://gerrit.wikimedia.org/r/507281 (owner: 10Volans) [14:02:06] (03CR) 10jenkins-bot: doc: add checker to ensure modules are documented [software/spicerack] - 10https://gerrit.wikimedia.org/r/507281 (owner: 10Volans) [14:04:12] (03PS1) 10Ema: Revert "Revert "package_builder: move lintian out of require_package"" [puppet] - 10https://gerrit.wikimedia.org/r/507324 (https://phabricator.wikimedia.org/T221784) [14:05:27] (03CR) 10BBlack: [C: 03+2] wm.org no-op cleanup: no empty left-hand-side [dns] - 10https://gerrit.wikimedia.org/r/501610 (owner: 10BBlack) [14:05:32] (03CR) 10BBlack: [C: 03+2] wm.org no-op cleanup: prefer @ to wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/501611 (owner: 10BBlack) [14:05:35] (03CR) 10BBlack: [C: 03+2] wm.org no-op cleanup: Group on hostnames for meta [dns] - 10https://gerrit.wikimedia.org/r/501612 (owner: 10BBlack) [14:05:38] (03CR) 10BBlack: [C: 03+2] wm.org no-op cleanup: re-arrange top section [dns] - 10https://gerrit.wikimedia.org/r/501613 (owner: 10BBlack) [14:05:43] (03CR) 10BBlack: [C: 03+2] wm.org no-op cleanup: move other meta up from end [dns] - 10https://gerrit.wikimedia.org/r/507093 (owner: 10BBlack) [14:09:41] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4022.ulsfo.wmnet'] ` Of which those **FAILED**: ` ['cp4022.ulsfo.wmnet'] ` [14:09:56] (03CR) 10BBlack: [C: 03+2] ns[012].wikimedia.org: 1D TTLs [dns] - 10https://gerrit.wikimedia.org/r/501614 (owner: 10BBlack) [14:11:46] PROBLEM - Long running screen/tmux on ganeti2003 is CRITICAL: CRIT: Long running tmux process. (user: fsero PID: 3700, 1739799s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [14:11:53] 10Operations, 10cloud-services-team, 10serviceops: Change /r/p/ to /r/ on all hosts (where https://gerrit.wikimedia.org/r/p/ exists) - https://phabricator.wikimedia.org/T222093 (10Paladox) [14:12:11] 10Operations, 10cloud-services-team, 10serviceops: Change /r/p/ to /r/ on all hosts (where https://gerrit.wikimedia.org/r/p/ exists) - https://phabricator.wikimedia.org/T222093 (10Paladox) [14:12:33] (03CR) 10Volans: icinga: create and apply cirrus config check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [14:13:49] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Support upgrades which introduce changes to binary package names (client side) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/506171 (owner: 10Muehlenhoff) [14:14:56] 10Operations: Integrate Stretch 9.9 point update - https://phabricator.wikimedia.org/T222053 (10MoritzMuehlenhoff) [14:15:16] !log cdanis@prometheus1003.eqiad.wmnet ~ % sudo enable-puppet 'cdanis testing original query.max-samples T222105' [14:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:21] T222105: prometheus: current query limits are insufficient to prevent OOMs - https://phabricator.wikimedia.org/T222105 [14:16:30] 10Operations: Integrate Stretch 9.9 point update - https://phabricator.wikimedia.org/T222053 (10MoritzMuehlenhoff) [14:16:33] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/507322 (owner: 10Volans) [14:17:04] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'prometheus2003*' 'run-puppet-agent --enable "staged rollout T222105 by cdanis"' [14:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:24] (03PS1) 10Ottomata: eventschemas::service - gzip minified static css/js files [puppet] - 10https://gerrit.wikimedia.org/r/507326 (https://phabricator.wikimedia.org/T219552) [14:18:34] (03CR) 10Volans: [C: 03+2] doc: mark Sphinx warnings as error [software/spicerack] - 10https://gerrit.wikimedia.org/r/507322 (owner: 10Volans) [14:19:12] looks like we had more artificial prometheus dropouts on eqiad DNS data in the past couple of hours [14:19:15] https://grafana.wikimedia.org/d/000000341/dns?orgId=1&from=now-3h&to=now [14:19:16] (03PS2) 10Ottomata: eventschemas::service - gzip minified static css/js files [puppet] - 10https://gerrit.wikimedia.org/r/507326 (https://phabricator.wikimedia.org/T219552) [14:20:05] (03CR) 10Gehel: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/506955 (owner: 10Volans) [14:20:28] (03PS1) 10Muehlenhoff: Drop trusty from debdeploy config [puppet] - 10https://gerrit.wikimedia.org/r/507327 [14:20:44] (03CR) 10Ottomata: [C: 03+2] eventschemas::service - gzip minified static css/js files [puppet] - 10https://gerrit.wikimedia.org/r/507326 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata) [14:21:58] 10Operations, 10Discovery-Search, 10Wikidata, 10Wikidata-Query-Service: Create Cookbook to restart WDQS - https://phabricator.wikimedia.org/T221832 (10Mathew.onipe) @Lucas_Werkmeister_WMDE Thanks! will keep that mind. Your reviews will be welcome when I submit a patch too. [14:22:23] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Create Cookbook to restart WDQS - https://phabricator.wikimedia.org/T221832 (10Mathew.onipe) p:05Triage→03Normal [14:22:37] 10Operations, 10observability, 10Wikimedia-Incident: figure out why Kafka dashboard hammers Prometheus, and fix it - https://phabricator.wikimedia.org/T222112 (10CDanis) I'm pretty sure it is these panels that are responsible for the most Prometheus load {F28868610} They take much longer to load than the res... [14:22:38] (03Merged) 10jenkins-bot: doc: mark Sphinx warnings as error [software/spicerack] - 10https://gerrit.wikimedia.org/r/507322 (owner: 10Volans) [14:23:31] (03CR) 10jenkins-bot: doc: mark Sphinx warnings as error [software/spicerack] - 10https://gerrit.wikimedia.org/r/507322 (owner: 10Volans) [14:23:49] (03PS3) 10Volans: cookbook API: drop get_title() support [software/spicerack] - 10https://gerrit.wikimedia.org/r/506955 [14:24:30] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'prometheus2004*' 'run-puppet-agent --enable "staged rollout T222105 by cdanis"' [14:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:34] T222105: prometheus: current query limits are insufficient to prevent OOMs - https://phabricator.wikimedia.org/T222105 [14:26:33] !log disable-puppet "T220987: global kafaka log shipping - staged rollout (jbond)" [14:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:39] T220987: Ferm: send ferm/iptables/ulogd logs to Kafaka/logstash/elasticsearch - https://phabricator.wikimedia.org/T220987 [14:28:30] (03CR) 10BBlack: [C: 03+2] wm.org: 1h non-dyna records for foo.dcname entries [dns] - 10https://gerrit.wikimedia.org/r/501615 (owner: 10BBlack) [14:31:03] (03CR) 10Jbond: [C: 03+2] kafka shipper: move kafka rsyslog shipping to base profile [puppet] - 10https://gerrit.wikimedia.org/r/505737 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond) [14:31:23] (03PS7) 10Jbond: kafka shipper: move kafka rsyslog shipping to base profile [puppet] - 10https://gerrit.wikimedia.org/r/505737 (https://phabricator.wikimedia.org/T220987) [14:31:27] (03CR) 10Jbond: [V: 03+2 C: 03+2] kafka shipper: move kafka rsyslog shipping to base profile [puppet] - 10https://gerrit.wikimedia.org/r/505737 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond) [14:32:26] (03PS1) 10BBlack: CAA: Add LE to issuewild for policy.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/507330 [14:32:57] (03CR) 10BBlack: [C: 03+2] CAA: Add LE to issuewild for policy.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/507330 (owner: 10BBlack) [14:33:51] (03CR) 10Muehlenhoff: [C: 03+2] "I've updated the check_conntrack run book a bit." [puppet] - 10https://gerrit.wikimedia.org/r/506550 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [14:34:17] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/16212/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507306 (owner: 10Elukey) [14:34:19] (03CR) 10Elukey: [C: 03+2] Deprecate GELF logging [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507297 (owner: 10Elukey) [14:34:31] (03CR) 10Elukey: [V: 03+2 C: 03+2] Enable hdfs-audit log [puppet/cdh] - 10https://gerrit.wikimedia.org/r/507306 (owner: 10Elukey) [14:35:31] (03CR) 10Mathew.onipe: Add postgres slave init cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [14:36:17] (03PS1) 10Elukey: Update cdh module to its latest version [puppet] - 10https://gerrit.wikimedia.org/r/507333 [14:38:51] (03CR) 10Volans: [C: 03+2] cookbook API: drop get_title() support [software/spicerack] - 10https://gerrit.wikimedia.org/r/506955 (owner: 10Volans) [14:39:40] (03PS6) 10Mathew.onipe: Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) [14:41:17] (03CR) 10jerkins-bot: [V: 04-1] Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [14:42:18] (03PS7) 10Mathew.onipe: Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) [14:43:25] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'bast5001*' 'run-puppet-agent --enable "staged rollout T222105 by cdanis"' [14:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:29] T222105: prometheus: current query limits are insufficient to prevent OOMs - https://phabricator.wikimedia.org/T222105 [14:43:59] (03CR) 10jerkins-bot: [V: 04-1] Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [14:44:04] (03Merged) 10jenkins-bot: cookbook API: drop get_title() support [software/spicerack] - 10https://gerrit.wikimedia.org/r/506955 (owner: 10Volans) [14:44:15] !log Sending 1% of anonymous users to PHP7.2 - T219150 [14:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:19] T219150: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 [14:44:35] (03PS2) 10Elukey: Update cdh module to its latest version [puppet] - 10https://gerrit.wikimedia.org/r/507333 (https://phabricator.wikimedia.org/T220702) [14:44:59] (03CR) 10jenkins-bot: cookbook API: drop get_title() support [software/spicerack] - 10https://gerrit.wikimedia.org/r/506955 (owner: 10Volans) [14:46:33] (03CR) 10Effie Mouzeli: [C: 03+2] Send 1% of anonymous users to PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506953 (https://phabricator.wikimedia.org/T219150) (owner: 10Giuseppe Lavagetto) [14:48:37] (03PS1) 10BBlack: wm.org: add IN where missing on DYNAs [dns] - 10https://gerrit.wikimedia.org/r/507337 [14:49:18] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'labmon1001*' 'run-puppet-agent --enable "staged rollout T222105 by cdanis"' [14:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:23] T222105: prometheus: current query limits are insufficient to prevent OOMs - https://phabricator.wikimedia.org/T222105 [14:49:51] (03CR) 10Muehlenhoff: "I like the approach, some comments inline" (038 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/506188 (owner: 10Jbond) [14:50:21] (03PS1) 10Ottomata: eventschemas::service - use relative path to ./repositories [puppet] - 10https://gerrit.wikimedia.org/r/507338 (https://phabricator.wikimedia.org/T219552) [14:51:29] (03CR) 10BBlack: [C: 03+2] wm.org: add IN where missing on DYNAs [dns] - 10https://gerrit.wikimedia.org/r/507337 (owner: 10BBlack) [14:52:04] (03PS1) 10Volans: monitoring: detect Puppet dependency cycle failure [puppet] - 10https://gerrit.wikimedia.org/r/507339 (https://phabricator.wikimedia.org/T221784) [14:53:59] (03CR) 10Ottomata: [C: 03+2] eventschemas::service - use relative path to ./repositories [puppet] - 10https://gerrit.wikimedia.org/r/507338 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata) [14:54:45] (03PS1) 10Andrew Bogott: wmcs: Remove puppet code for the 'main' region [puppet] - 10https://gerrit.wikimedia.org/r/507340 (https://phabricator.wikimedia.org/T167293) [14:54:48] (03CR) 10Ema: [C: 03+1] "Looks great, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/507339 (https://phabricator.wikimedia.org/T221784) (owner: 10Volans) [14:55:05] (03PS2) 10Ottomata: eventschemas::service - use relative path to ./repositories [puppet] - 10https://gerrit.wikimedia.org/r/507338 (https://phabricator.wikimedia.org/T219552) [14:55:41] (03CR) 10Ottomata: [C: 03+2] eventschemas::service - use relative path to ./repositories [puppet] - 10https://gerrit.wikimedia.org/r/507338 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata) [14:55:48] (03CR) 10Andrew Bogott: "There's no rush to merge this but I'm running it through the puppet compiler to see what we've missed." [puppet] - 10https://gerrit.wikimedia.org/r/507340 (https://phabricator.wikimedia.org/T167293) (owner: 10Andrew Bogott) [14:56:01] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'bast3002*' 'run-puppet-agent --enable "filippo prometheus"' [14:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:29] (03PS3) 10Elukey: Update cdh module to its latest version [puppet] - 10https://gerrit.wikimedia.org/r/507333 (https://phabricator.wikimedia.org/T220702) [14:56:42] (03PS1) 10Ottomata: eventschemas::service - Don't render config.js.erb, it is in files/ now [puppet] - 10https://gerrit.wikimedia.org/r/507341 (https://phabricator.wikimedia.org/T219552) [14:57:17] (03CR) 10Ottomata: [C: 03+2] eventschemas::service - Don't render config.js.erb, it is in files/ now [puppet] - 10https://gerrit.wikimedia.org/r/507341 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata) [14:58:42] !log enable-puppet "T220987: global kafaka log shipping - staged rollout (jbond)" [14:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:45] T220987: Ferm: send ferm/iptables/ulogd logs to Kafaka/logstash/elasticsearch - https://phabricator.wikimedia.org/T220987 [15:00:50] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/16213/" [puppet] - 10https://gerrit.wikimedia.org/r/507333 (https://phabricator.wikimedia.org/T220702) (owner: 10Elukey) [15:00:52] (03PS4) 10Elukey: Update cdh module to its latest version [puppet] - 10https://gerrit.wikimedia.org/r/507333 (https://phabricator.wikimedia.org/T220702) [15:01:59] (03PS3) 10Giuseppe Lavagetto: Send 1% of anonymous users to PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506953 (https://phabricator.wikimedia.org/T219150) [15:03:57] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Add fake ssh keys for netbox network user [labs/private] - 10https://gerrit.wikimedia.org/r/507231 (owner: 10Ayounsi) [15:06:23] (03CR) 10Jbond: [C: 03+2] kafka shipper: add ulogd to kafka forwarding rules [puppet] - 10https://gerrit.wikimedia.org/r/505750 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond) [15:06:34] (03PS5) 10Jbond: kafka shipper: add ulogd to kafka forwarding rules [puppet] - 10https://gerrit.wikimedia.org/r/505750 (https://phabricator.wikimedia.org/T220987) [15:08:05] <_joe_> jijiki: are you deploying? [15:08:09] yes [15:08:11] 10Operations, 10observability, 10Patch-For-Review, 10Wikimedia-Incident: prometheus: current query limits are insufficient to prevent OOMs - https://phabricator.wikimedia.org/T222105 (10CDanis) 05Open→03Resolved a:03CDanis As documented in T222112#5147131 this didn't actually fix the dashboard at fau... [15:08:13] about to [15:08:13] <_joe_> ok [15:08:21] <_joe_> yeah let's please move on [15:08:38] <_joe_> jenkins made us waste 20 minutes [15:09:23] (03CR) 10jenkins-bot: Send 1% of anonymous users to PHP7.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506953 (https://phabricator.wikimedia.org/T219150) (owner: 10Giuseppe Lavagetto) [15:09:58] !log jiji@deploy1001 Synchronized wmf-config/CommonSettings.php: Send 1% of anonymous users to PHP7.2 - T219150 (duration: 00m 54s) [15:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:03] T219150: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 [15:11:00] (03PS2) 10Volans: monitoring: detect Puppet dependency cycle failure [puppet] - 10https://gerrit.wikimedia.org/r/507339 (https://phabricator.wikimedia.org/T221784) [15:12:06] (03CR) 10Volans: [C: 03+2] monitoring: detect Puppet dependency cycle failure [puppet] - 10https://gerrit.wikimedia.org/r/507339 (https://phabricator.wikimedia.org/T221784) (owner: 10Volans) [15:13:54] (03PS1) 10Herron: exim-minimal: increase localhost exim max connections [puppet] - 10https://gerrit.wikimedia.org/r/507345 [15:16:00] (03CR) 10Paladox: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/507345 (owner: 10Herron) [15:17:25] (03PS2) 10Ema: Revert "Revert "package_builder: move lintian out of require_package"" [puppet] - 10https://gerrit.wikimedia.org/r/507324 (https://phabricator.wikimedia.org/T221784) [15:18:19] (03CR) 10Ema: [C: 03+2] Revert "Revert "package_builder: move lintian out of require_package"" [puppet] - 10https://gerrit.wikimedia.org/r/507324 (https://phabricator.wikimedia.org/T221784) (owner: 10Ema) [15:18:27] (03PS2) 10Herron: exim-minimal: increase localhost exim max connections [puppet] - 10https://gerrit.wikimedia.org/r/507345 [15:18:29] (03PS1) 10Mathew.onipe: wdqs: add WDQS restart cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) [15:18:39] !log stop s8 instance on dbstore2001 for cloning to db2100 T220572 [15:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:43] T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host - https://phabricator.wikimedia.org/T220572 [15:20:44] 10Operations, 10fundraising-tech-ops, 10Patch-For-Review: rack/setup/install Prometeuse/Grafana host frmon2001 for fr-tech - https://phabricator.wikimedia.org/T196476 (10cwdent) [15:20:48] 10Operations, 10fundraising-tech-ops, 10netops, 10Patch-For-Review: Network setup for frmon2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T221475 (10cwdent) 05Open→03Resolved @Papaul thanks! Working :) [15:20:52] (03PS3) 10Herron: exim-minimal: increase localhost exim max connections [puppet] - 10https://gerrit.wikimedia.org/r/507345 [15:21:40] (03CR) 10Herron: [C: 03+2] exim-minimal: increase localhost exim max connections [puppet] - 10https://gerrit.wikimedia.org/r/507345 (owner: 10Herron) [15:24:44] (03PS2) 10Herron: remove granularity key from wiki-mail DKIM [dns] - 10https://gerrit.wikimedia.org/r/504948 (https://phabricator.wikimedia.org/T221290) (owner: 10Cwhite) [15:26:46] (03PS8) 10Mathew.onipe: Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) [15:28:04] (03CR) 10Jbond: [C: 03+2] facter3/puppet5: enable puppet5/facter3 eqsin [puppet] - 10https://gerrit.wikimedia.org/r/507299 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [15:28:12] (03PS2) 10Jbond: facter3/puppet5: enable puppet5/facter3 eqsin [puppet] - 10https://gerrit.wikimedia.org/r/507299 (https://phabricator.wikimedia.org/T219803) [15:31:33] 10Operations, 10observability, 10Patch-For-Review: 100% of Prometheus traffic served by Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) [15:31:48] 10Operations, 10observability, 10Patch-For-Review: 100% of Prometheus traffic served by Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) [15:32:59] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[rsyslog-kafka] [15:33:03] 10Operations, 10serviceops, 10Patch-For-Review, 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10Jdforrester-WMF) [15:33:23] 10Operations, 10serviceops, 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10Jdforrester-WMF) [15:33:39] (03PS6) 10Paladox: scap: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507072 (https://phabricator.wikimedia.org/T218844) [15:35:22] looking at labstore1003 now [15:41:30] (03PS1) 10Jbond: kafka: disable kafka on trusty [puppet] - 10https://gerrit.wikimedia.org/r/507350 [15:41:44] (03PS1) 10BryanDavis: Revert "striker: Disable developer account creation" [puppet] - 10https://gerrit.wikimedia.org/r/507351 (https://phabricator.wikimedia.org/T219830) [15:41:59] (03CR) 10jerkins-bot: [V: 04-1] Revert "striker: Disable developer account creation" [puppet] - 10https://gerrit.wikimedia.org/r/507351 (https://phabricator.wikimedia.org/T219830) (owner: 10BryanDavis) [15:43:26] (03CR) 10Jbond: [C: 03+2] kafka: disable kafka on trusty [puppet] - 10https://gerrit.wikimedia.org/r/507350 (owner: 10Jbond) [15:43:57] (03PS2) 10BryanDavis: Revert "striker: Disable developer account creation" [puppet] - 10https://gerrit.wikimedia.org/r/507351 (https://phabricator.wikimedia.org/T219830) [15:45:36] (03CR) 10CRusnov: "Rambles inline." (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [15:45:36] !log restart hadoop hdfs namenodes on an-master100[1,2] to pick up new logging settings - T220702 [15:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:43] T220702: Check if HDFS offers a way to prevent/limit/throttle users to overwhelm the HDFS Namenode - https://phabricator.wikimedia.org/T220702 [15:50:09] (03CR) 10Alex Monk: "I think that apache user was actually in use on precise." [puppet] - 10https://gerrit.wikimedia.org/r/506750 (owner: 10Dzahn) [15:50:40] 10Operations, 10observability, 10Patch-For-Review: 100% of Prometheus traffic served by Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) Labmon hosts are in active/standby pairs for graphite whereas prometheus runs on both independently but only labmon1001 is used for queries via `pro... [15:50:41] (03CR) 10CRusnov: [C: 04-1] "Looks good, minor nit." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507224 (owner: 10Ayounsi) [15:51:09] (03CR) 10Alex Monk: "T78076 was related" [puppet] - 10https://gerrit.wikimedia.org/r/506750 (owner: 10Dzahn) [15:54:48] 10Operations, 10decommission: Decommission labservices1001, 1002 - https://phabricator.wikimedia.org/T221857 (10RobH) p:05Triage→03Normal [15:55:09] (03PS1) 10Ottomata: Include profile::standard and base::firewall in role::eventschemas::service [puppet] - 10https://gerrit.wikimedia.org/r/507353 (https://phabricator.wikimedia.org/T219556) [15:55:18] 10Operations, 10decommission: Decommission labservices1001, 1002 - https://phabricator.wikimedia.org/T221857 (10RobH) [15:55:48] 10Operations, 10decommission: Decommission labservices1001 & labservices1002 - https://phabricator.wikimedia.org/T221857 (10RobH) [15:56:10] (03PS2) 10Ottomata: Include profile::standard and base::firewall in role::eventschemas::service [puppet] - 10https://gerrit.wikimedia.org/r/507353 (https://phabricator.wikimedia.org/T219556) [15:57:09] (03CR) 10Ottomata: [C: 03+2] Include profile::standard and base::firewall in role::eventschemas::service [puppet] - 10https://gerrit.wikimedia.org/r/507353 (https://phabricator.wikimedia.org/T219556) (owner: 10Ottomata) [15:57:15] (03CR) 10Paladox: [C: 03+1] wikitech: Provision gerrit api auth credentials [puppet] - 10https://gerrit.wikimedia.org/r/506588 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis) [15:57:23] 10Operations, 10decommission: Decommission labservices1001 & labservices1002 - https://phabricator.wikimedia.org/T221857 (10RobH) network info: labservices1001 : asw2-d-eqiad:ge-3/0/9 labservices1002 : asw2-a-eqiad:ge-4/0/12 [15:58:32] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [15:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:35] PROBLEM - Check systemd state on ms-be2014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:58:38] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [15:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:43] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [15:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:47] 10Operations, 10decommission: Decommission labservices1001 & labservices1002 - https://phabricator.wikimedia.org/T221857 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `labservices1001.wikimedia.org` - labservices1001.wikimedia.org - Removed from Puppet master... [15:58:49] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [15:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:55] 10Operations, 10decommission: Decommission labservices1001 & labservices1002 - https://phabricator.wikimedia.org/T221857 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `labservices1002.wikimedia.org` - labservices1002.wikimedia.org - Removed from Puppet master... [16:00:04] godog and _joe_: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190430T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:16] (03PS1) 10RobH: decom labservices100[12] references [puppet] - 10https://gerrit.wikimedia.org/r/507354 (https://phabricator.wikimedia.org/T221857) [16:01:31] (03CR) 10RobH: [C: 03+2] decom labservices100[12] references [puppet] - 10https://gerrit.wikimedia.org/r/507354 (https://phabricator.wikimedia.org/T221857) (owner: 10RobH) [16:01:39] (03PS2) 10RobH: decom labservices100[12] references [puppet] - 10https://gerrit.wikimedia.org/r/507354 (https://phabricator.wikimedia.org/T221857) [16:03:15] (03PS2) 10RobH: decom labnet100[12] production dns [dns] - 10https://gerrit.wikimedia.org/r/506574 (https://phabricator.wikimedia.org/T221818) [16:03:54] (03CR) 10RobH: [C: 03+2] decom labnet100[12] production dns [dns] - 10https://gerrit.wikimedia.org/r/506574 (https://phabricator.wikimedia.org/T221818) (owner: 10RobH) [16:04:50] !log pool cp4022 w/ ATS backend T219967 [16:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:54] T219967: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 [16:06:30] (03PS1) 10RobH: decom labservices100[12] prod dns [dns] - 10https://gerrit.wikimedia.org/r/507356 (https://phabricator.wikimedia.org/T221857) [16:06:58] (03CR) 10RobH: [C: 03+2] decom labservices100[12] prod dns [dns] - 10https://gerrit.wikimedia.org/r/507356 (https://phabricator.wikimedia.org/T221857) (owner: 10RobH) [16:08:01] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [16:08:13] 10Operations, 10Performance-Team, 10Thumbor, 10Traffic, 10Patch-For-Review: SwiftMedia URL rewrite returns some 404s with wrong Content-Length - https://phabricator.wikimedia.org/T222071 (10fgiunchedi) I'm wondering if the underlying issue here (copying responses inside `rewrite.py`) could be the culprit... [16:08:37] 10Operations, 10serviceops: TEC3:Q4 Tracking task - https://phabricator.wikimedia.org/T220403 (10thcipriani) [16:09:19] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [16:09:34] (03CR) 10Volans: [C: 04-1] "Minor comments inline" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/507224 (owner: 10Ayounsi) [16:09:37] (03PS1) 10Arturo Borrero Gonzalez: base: make RAID optional [puppet] - 10https://gerrit.wikimedia.org/r/507357 [16:09:54] 10Operations, 10decommission, 10Patch-For-Review: Decommission labservices1001 & labservices1002 - https://phabricator.wikimedia.org/T221857 (10RobH) [16:10:16] 10Operations, 10ops-eqiad, 10decommission: Decommission labservices1001 & labservices1002 - https://phabricator.wikimedia.org/T221857 (10RobH) a:05RobH→03Cmjohnson [16:11:27] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/500413 (owner: 10Muehlenhoff) [16:12:05] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops: Migrate ORES to kubernetes - https://phabricator.wikimedia.org/T220400 (10thcipriani) [16:12:08] 10Operations, 10ORES, 10Scoring-platform-team: [Epic] Deploy ORES in kubernetes cluster - https://phabricator.wikimedia.org/T182331 (10thcipriani) [16:12:27] 10Operations, 10observability, 10Patch-For-Review: 100% of Prometheus traffic served by Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10aborrero) >>! In T187987#5147514, @fgiunchedi wrote: > Labmon hosts are in active/standby pairs for graphite whereas prometheus runs on both independently but o... [16:12:40] 10Operations, 10Gerrit, 10cloud-services-team, 10serviceops: Change /r/p/ to /r/ on all hosts (where https://gerrit.wikimedia.org/r/p/ exists) - https://phabricator.wikimedia.org/T222093 (10Paladox) [16:12:47] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 2 others: TEC3:O3:O3.1:Q4 Goal - Move cpjobqueue, Wikidata Termbox SSR (new service), Kask (session storage service) and ORES (partially) through the production CD Pipeline - https://phabricator.wikimedia.org/T220398 (10thcipria... [16:12:51] 10Operations, 10ORES, 10Scoring-platform-team: [Epic] Deploy ORES in kubernetes cluster - https://phabricator.wikimedia.org/T182331 (10thcipriani) [16:14:23] 10Operations, 10ORES, 10Scoring-platform-team, 10Release Pipeline (Blubber): Build blubber file for ORES - https://phabricator.wikimedia.org/T210268 (10thcipriani) [16:16:12] 10Operations, 10ORES, 10Release Pipeline, 10Scoring-platform-team, and 2 others: Execution of the deployment pipeline should be configurable via .pipeline/config.yaml - https://phabricator.wikimedia.org/T210267 (10thcipriani) [16:18:29] (03PS1) 10Ema: cache: add hiera setting for varnish backend restarts [puppet] - 10https://gerrit.wikimedia.org/r/507358 (https://phabricator.wikimedia.org/T219967) [16:18:47] (03PS1) 10CRusnov: profile spicerack: Add Ganeti module configuration [puppet] - 10https://gerrit.wikimedia.org/r/507359 [16:19:25] (03CR) 10jerkins-bot: [V: 04-1] profile spicerack: Add Ganeti module configuration [puppet] - 10https://gerrit.wikimedia.org/r/507359 (owner: 10CRusnov) [16:19:46] you need to specify it's a dir [16:19:54] 10Operations, 10Mail: Gmail - Multiple destination domains per transaction is unsupported. Please try again. - https://phabricator.wikimedia.org/T222198 (10herron) p:05Triage→03Normal [16:19:56] ops wrong chan [16:20:46] (03PS2) 10CRusnov: profile spicerack: Add Ganeti module configuration [puppet] - 10https://gerrit.wikimedia.org/r/507359 [16:26:08] (03PS19) 10Volans: Netbox module for Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [16:26:41] (03PS3) 10CRusnov: profile spicerack: Add Ganeti module configuration [puppet] - 10https://gerrit.wikimedia.org/r/507359 [16:26:52] 10Operations, 10observability, 10Wikimedia-Incident: prometheus: upgrade to 2.9.2 - https://phabricator.wikimedia.org/T222113 (10fgiunchedi) +1 to testing/PoC 2.9.2; we're using Debian Prometheus packages mostly verbatim, but adding back the k8s discovery + dependencies back as they are not shipped in Debian... [16:26:55] !log upgrade puppet and facter in eqsin [16:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:07] PROBLEM - Check correctness of the icinga configuration on icinga1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [16:27:26] (03CR) 10Ema: [C: 03+2] cache: add hiera setting for varnish backend restarts [puppet] - 10https://gerrit.wikimedia.org/r/507358 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [16:27:40] ottomata: missing eventschemas_codfw in icinga config ^^^ [16:27:52] !log upgrade librenms to 1.51 [16:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:17] looking [16:28:29] ottomata: it's the usual hieradata/common/monitoring.yaml [16:28:41] not sure i know this 'usual' :p [16:29:14] any new value of "$cluster_$dc" must be defined there [16:29:34] or icinga complains [16:29:43] hm.... [16:30:16] 10Operations, 10Core Platform Team Kanban (Blocked Externally), 10Services (blocked), 10User-Eevans, 10User-fgiunchedi: New upstream jvm-tools - https://phabricator.wikimedia.org/T178839 (10fgiunchedi) >>! In T178839#5144895, @mobrovac wrote: > @Eevans @fgiunchedi is there a plan to resume this work or s... [16:30:18] cluster groups are defined as $cluster_$dc [16:31:19] (03CR) 10jerkins-bot: [V: 04-1] Netbox module for Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [16:32:52] (03PS1) 10Ottomata: Add eventschemas clusters in monitoring.yaml [puppet] - 10https://gerrit.wikimedia.org/r/507364 (https://phabricator.wikimedia.org/T219552) [16:33:56] ottomata: that's not for the svc. stuff, but for the hosts [16:34:01] the schemaNNNN hosts [16:34:08] they have [16:34:08] hostgroups eventschemas_eqiad [16:34:11] in icinga config [16:34:30] just FYI, the comment seem a bit confusing [16:34:44] (03PS1) 10Herron: mx: disable multi_domain in smtp transports [puppet] - 10https://gerrit.wikimedia.org/r/507365 (https://phabricator.wikimedia.org/T222198) [16:36:20] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10fgiunchedi) FYI the upgrade seems to be generating cronspam, in the form of facter warnings: `lines=5 Subject: Cron /usr/local/sbin/smart-data-du... [16:36:43] !log ayounsi@deploy1001 Started deploy [librenms/librenms@2094575]: Upgrade LibreNMS to 1.51 - T207706 [16:36:45] (03PS2) 10Ottomata: Add eventschemas clusters in monitoring.yaml [puppet] - 10https://gerrit.wikimedia.org/r/507364 (https://phabricator.wikimedia.org/T219552) [16:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:47] T207706: LibreNMS upgrade to 1.49 - https://phabricator.wikimedia.org/T207706 [16:36:52] !log ayounsi@deploy1001 Finished deploy [librenms/librenms@2094575]: Upgrade LibreNMS to 1.51 - T207706 (duration: 00m 11s) [16:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:07] (03PS20) 10CRusnov: Netbox module for Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) [16:38:56] (03CR) 10CRusnov: "https://puppet-compiler.wmflabs.org/compiler1001/16222/cumin1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/507359 (owner: 10CRusnov) [16:39:07] RECOVERY - Check systemd state on ms-be2014 is OK: OK - running: The system is fully operational [16:40:13] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:40:15] (03CR) 10Gehel: [C: 04-1] "minor comments inline, otherwise LGTM" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) (owner: 10Mathew.onipe) [16:42:37] PROBLEM - LibreNMS HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 286 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/LibreNMS [16:42:40] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/506378 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [16:42:51] (03PS5) 10Jcrespo: mariadb-backups: Setup new backup source hosts for codfw backups [puppet] - 10https://gerrit.wikimedia.org/r/506947 (https://phabricator.wikimedia.org/T206203) [16:42:56] (03CR) 10jerkins-bot: [V: 04-1] Netbox module for Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [16:44:50] (03PS6) 10Jcrespo: mariadb-backups: Setup new backup source hosts for codfw backups [puppet] - 10https://gerrit.wikimedia.org/r/506947 (https://phabricator.wikimedia.org/T206203) [16:46:15] ACKNOWLEDGEMENT - LibreNMS HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 286 bytes in 0.010 second response time Ayounsi Upgrading to 1.51 https://wikitech.wikimedia.org/wiki/LibreNMS [16:46:31] (03PS2) 10Arturo Borrero Gonzalez: base: make RAID optional [puppet] - 10https://gerrit.wikimedia.org/r/507357 [16:46:37] (03CR) 10CRusnov: "Some initial comments inline. Basically we need to get our crap together configuration-wise :)" (032 comments) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/507217 (owner: 10Ayounsi) [16:47:54] (03PS3) 10Arturo Borrero Gonzalez: base: make RAID optional [puppet] - 10https://gerrit.wikimedia.org/r/507357 [16:47:57] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Setup new backup source hosts for codfw backups [puppet] - 10https://gerrit.wikimedia.org/r/506947 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [16:49:01] (03PS4) 10Arturo Borrero Gonzalez: base: make RAID optional [puppet] - 10https://gerrit.wikimedia.org/r/507357 [16:50:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] base: make RAID optional [puppet] - 10https://gerrit.wikimedia.org/r/507357 (owner: 10Arturo Borrero Gonzalez) [16:50:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC is OK: https://puppet-compiler.wmflabs.org/compiler1002/16221/ even though it didn't finish because the compilation server run out of " [puppet] - 10https://gerrit.wikimedia.org/r/507357 (owner: 10Arturo Borrero Gonzalez) [16:51:23] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/507359 (owner: 10CRusnov) [16:51:46] (03PS4) 10CRusnov: profile spicerack: Add Ganeti module configuration [puppet] - 10https://gerrit.wikimedia.org/r/507359 [16:51:56] (03CR) 10Gehel: [C: 04-1] "some more issues" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [16:52:25] (03CR) 10Volans: [C: 03+1] "LGTM, nit inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507364 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata) [16:52:26] !log merging change to `profile::base` and `::raid` https://gerrit.wikimedia.org/r/c/operations/puppet/+/507357 related to T221225 [16:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:31] T221225: sssd integration needs to be updated to include sudo config from LDAP support - https://phabricator.wikimedia.org/T221225 [16:52:46] (03CR) 10CRusnov: [C: 03+2] profile spicerack: Add Ganeti module configuration [puppet] - 10https://gerrit.wikimedia.org/r/507359 (owner: 10CRusnov) [16:54:09] (03CR) 10Fsero: [C: 04-1] "LGTM overall, but please change the metric-config configmap name to include wmf.releasename also" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/506166 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata) [16:55:55] (03PS11) 10MacFan4000: Set wgNoticeProjects for wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471663 (https://phabricator.wikimedia.org/T208694) [16:56:07] (03PS21) 10CRusnov: Netbox module for Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) [16:57:36] !log ayounsi@deploy1001 Started deploy [librenms/librenms@0fd8da6]: Rollback LibreNMS [16:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:41] !log ayounsi@deploy1001 Finished deploy [librenms/librenms@0fd8da6]: Rollback LibreNMS (duration: 00m 09s) [16:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:19] RECOVERY - LibreNMS HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 200 OK - 8737 bytes in 0.031 second response time https://wikitech.wikimedia.org/wiki/LibreNMS [16:58:56] (03CR) 10BryanDavis: wikitech: Disable Gerrit accounts when blocked on wikitech (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506587 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis) [16:59:50] (03PS1) 10CRusnov: Add emacs ignores to gitignore [software/spicerack] - 10https://gerrit.wikimedia.org/r/507370 [17:00:05] cscott, arlolra, subbu, and halfak: How many deployers does it take to do Services – Graphoid / Parsoid / Citoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190430T1700). [17:00:22] All of them [17:02:03] (03CR) 10Paladox: [C: 03+1] "What if the user is unblocked? Do we want users to create tasks on phabricator to ask to be unblocked from gerrit?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506587 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis) [17:03:20] (03CR) 10Reedy: "Their phab account is probably disabled/blocked too. See the code above" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506587 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis) [17:06:04] (03CR) 10Paladox: [C: 03+1] "> Their phab account is probably disabled/blocked too. See the code" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506587 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis) [17:06:06] (03PS1) 10Jcrespo: mariadb: Hide diffs on files that contain passwords for puppet [puppet] - 10https://gerrit.wikimedia.org/r/507373 [17:06:41] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Hide diffs on files that contain passwords for puppet [puppet] - 10https://gerrit.wikimedia.org/r/507373 (owner: 10Jcrespo) [17:07:13] (03PS1) 10Arturo Borrero Gonzalez: Revert "Revert "sudo: decouple sudo from sudo-ldap"" [puppet] - 10https://gerrit.wikimedia.org/r/507376 (https://phabricator.wikimedia.org/T221225) [17:07:37] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational [17:08:27] (03PS2) 10Jcrespo: mariadb: Hide diffs on files that contain passwords for puppet [puppet] - 10https://gerrit.wikimedia.org/r/507373 [17:09:33] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the quick fix! :)" [puppet] - 10https://gerrit.wikimedia.org/r/507373 (owner: 10Jcrespo) [17:10:25] (03PS3) 10Jcrespo: mariadb: Hide diffs on files that contain passwords for puppet [puppet] - 10https://gerrit.wikimedia.org/r/507373 [17:11:24] (03CR) 10Jcrespo: [C: 03+2] mariadb: Hide diffs on files that contain passwords for puppet [puppet] - 10https://gerrit.wikimedia.org/r/507373 (owner: 10Jcrespo) [17:11:48] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@1f09e44]: Update mobileapps to 142ba30 (T217837) [17:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:55] T217837: [BUG] mobile-html article body has wrong background color - https://phabricator.wikimedia.org/T217837 [17:13:31] (03Abandoned) 10Lucas Werkmeister (WMDE): Enable suggestion constraint status on Beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504389 (https://phabricator.wikimedia.org/T221107) (owner: 10Lucas Werkmeister (WMDE)) [17:15:47] herron: you around? [17:16:04] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@1f09e44]: Update mobileapps to 142ba30 (T217837) (duration: 04m 16s) [17:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:18] arturo: hey [17:16:56] herron: PCC jenkins worker nodes are out of disk space. Do you know how to handle that, or if simply rm -rf the output/ dir? [17:17:36] arturo: sure I can do some cleanup [17:17:48] just created a task about this yesterday as well [17:17:53] herron: cool thanks! [17:19:21] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10User-fsero: placeholder task for migration problems - https://phabricator.wikimedia.org/T222210 (10fsero) [17:20:21] ottomata: is icinga fixed? I got sidetracked [17:21:26] arturo: should be good to go now [17:21:34] volans: sorry was in meeting [17:22:00] herron: thanks!! [17:22:06] np! [17:22:07] volans: aren't all the comments misleading then? [17:22:19] oh i see what you mean [17:23:12] (03PS3) 10Ottomata: Add eventschemas clusters in monitoring.yaml [puppet] - 10https://gerrit.wikimedia.org/r/507364 (https://phabricator.wikimedia.org/T219552) [17:24:22] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/507364 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata) [17:25:24] (03CR) 10Ottomata: [C: 03+2] Add eventschemas clusters in monitoring.yaml [puppet] - 10https://gerrit.wikimedia.org/r/507364 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata) [17:25:31] (03PS4) 10Ottomata: Add eventschemas clusters in monitoring.yaml [puppet] - 10https://gerrit.wikimedia.org/r/507364 (https://phabricator.wikimedia.org/T219552) [17:25:33] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add eventschemas clusters in monitoring.yaml [puppet] - 10https://gerrit.wikimedia.org/r/507364 (https://phabricator.wikimedia.org/T219552) (owner: 10Ottomata) [17:32:22] (03PS1) 10CRusnov: profile::ganeti: Add cumin hosts to RAPI and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/507384 [17:33:10] (03CR) 10jerkins-bot: [V: 04-1] profile::ganeti: Add cumin hosts to RAPI and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/507384 (owner: 10CRusnov) [17:33:37] PROBLEM - puppet last run on kubestagetcd1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:34:03] ottomata: you didn't run puppet on icinga right? [17:34:31] no [17:34:33] runnign now [17:38:55] RECOVERY - Check correctness of the icinga configuration on icinga1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [17:39:05] \o/ thanks [17:41:08] (03PS1) 10CDanis: swift: hiera-ize object-replicator concurrency [puppet] - 10https://gerrit.wikimedia.org/r/507386 [17:41:20] (03CR) 10Gergő Tisza: [C: 03+1] wikitech: Disable Gerrit accounts when blocked on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506587 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis) [17:41:56] (03CR) 10jerkins-bot: [V: 04-1] swift: hiera-ize object-replicator concurrency [puppet] - 10https://gerrit.wikimedia.org/r/507386 (owner: 10CDanis) [17:43:59] (03PS2) 10Arturo Borrero Gonzalez: Revert "Revert "sudo: decouple sudo from sudo-ldap"" [puppet] - 10https://gerrit.wikimedia.org/r/507376 (https://phabricator.wikimedia.org/T221225) [17:45:29] (03PS2) 10CDanis: swift: hiera-ize object-replicator concurrency [puppet] - 10https://gerrit.wikimedia.org/r/507386 [17:47:35] (03PS3) 10Arturo Borrero Gonzalez: Revert "Revert "sudo: decouple sudo from sudo-ldap"" [puppet] - 10https://gerrit.wikimedia.org/r/507376 (https://phabricator.wikimedia.org/T221225) [17:50:01] (03PS2) 10CRusnov: profile::ganeti: Add cumin hosts to RAPI and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/507384 [17:51:32] (03PS4) 10Arturo Borrero Gonzalez: Revert "Revert "sudo: decouple sudo from sudo-ldap"" [puppet] - 10https://gerrit.wikimedia.org/r/507376 (https://phabricator.wikimedia.org/T221225) [17:52:26] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "sudo: decouple sudo from sudo-ldap"" [puppet] - 10https://gerrit.wikimedia.org/r/507376 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [17:53:27] (03PS1) 10CRusnov: Ganeti module: Add timeout support [software/spicerack] - 10https://gerrit.wikimedia.org/r/507390 [17:53:43] (03PS5) 10Arturo Borrero Gonzalez: Revert "Revert "sudo: decouple sudo from sudo-ldap"" [puppet] - 10https://gerrit.wikimedia.org/r/507376 (https://phabricator.wikimedia.org/T221225) [17:54:08] (03CR) 10CDanis: "PCC looks fine https://puppet-compiler.wmflabs.org/compiler1002/16238/" [puppet] - 10https://gerrit.wikimedia.org/r/507386 (owner: 10CDanis) [17:54:35] (03PS1) 10Ayounsi: Add PHP 7.2 to LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/507391 (https://phabricator.wikimedia.org/T207706) [17:56:48] (03PS2) 10Ayounsi: Add PHP 7.2 to LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/507391 (https://phabricator.wikimedia.org/T207706) [17:57:20] (03CR) 10Dzahn: [C: 03+1] "thanks for adding them! lgtm afaict. Volans is expert though" [puppet] - 10https://gerrit.wikimedia.org/r/507312 (https://phabricator.wikimedia.org/T219556) (owner: 10Ottomata) [17:58:46] (03CR) 10Volans: "I'm not familiar with Ganeti RAPI and how much time different calls take, but I'm not sure we need all this boilerplate and overhead to ju" [software/spicerack] - 10https://gerrit.wikimedia.org/r/507390 (owner: 10CRusnov) [17:58:55] (03CR) 10Dzahn: [C: 03+2] "perfect, thanks Moritz" [puppet] - 10https://gerrit.wikimedia.org/r/506550 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [17:59:02] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/16242/netmon1002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/507391 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [17:59:07] (03PS6) 10Dzahn: base::firewall: add runbooks for check_ferm and check_conntrack [puppet] - 10https://gerrit.wikimedia.org/r/506550 (https://phabricator.wikimedia.org/T197873) [17:59:29] (03PS3) 10CRusnov: profile::ganeti: Add cumin hosts to RAPI and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/507384 [17:59:33] (03PS6) 10Arturo Borrero Gonzalez: Revert "Revert "sudo: decouple sudo from sudo-ldap"" [puppet] - 10https://gerrit.wikimedia.org/r/507376 (https://phabricator.wikimedia.org/T221225) [17:59:47] (03CR) 10Volans: [C: 03+1] "Syntax wise they are correct." [puppet] - 10https://gerrit.wikimedia.org/r/507312 (https://phabricator.wikimedia.org/T219556) (owner: 10Ottomata) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190430T1800) [18:01:18] (03CR) 10CRusnov: "> Patch Set 1:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/507390 (owner: 10CRusnov) [18:02:31] (03PS7) 10Arturo Borrero Gonzalez: Revert "Revert "sudo: decouple sudo from sudo-ldap"" [puppet] - 10https://gerrit.wikimedia.org/r/507376 (https://phabricator.wikimedia.org/T221225) [18:03:00] jouncebot: now [18:03:00] For the next 0 hour(s) and 56 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190430T1800) [18:03:02] jouncebot: next [18:03:02] In 0 hour(s) and 56 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190430T1900) [18:04:31] RECOVERY - puppet last run on kubestagetcd1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:04:43] Reedy: You wanting to do something? [18:04:59] James_F: The same thing I do every night [18:05:05] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:06:19] (03CR) 10CRusnov: "PCC output" [puppet] - 10https://gerrit.wikimedia.org/r/507384 (owner: 10CRusnov) [18:08:34] (03PS3) 10CDanis: swift: hiera-ize object-replicator concurrency [puppet] - 10https://gerrit.wikimedia.org/r/507386 [18:09:18] !log start branchcut for 1.34.0-wmf.3 [18:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:28] (03PS8) 10Arturo Borrero Gonzalez: sudo: decouple sudo from sudo-ldap [puppet] - 10https://gerrit.wikimedia.org/r/507376 (https://phabricator.wikimedia.org/T221225) [18:10:18] (03CR) 10CDanis: [C: 03+2] swift: hiera-ize object-replicator concurrency [puppet] - 10https://gerrit.wikimedia.org/r/507386 (owner: 10CDanis) [18:10:36] (03CR) 10Dzahn: kafka: add icinga notes URLs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506719 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [18:11:05] (03PS3) 10Dzahn: kafka: add icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/506719 (https://phabricator.wikimedia.org/T197873) [18:11:07] (03PS1) 10CRusnov: Add more emacs things to .gitignore [puppet] - 10https://gerrit.wikimedia.org/r/507393 [18:11:17] (03CR) 10jerkins-bot: [V: 04-1] kafka: add icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/506719 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [18:12:05] RECOVERY - Long running screen/tmux on ganeti2003 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [18:12:25] 10Operations, 10observability, 10Patch-For-Review: LibreNMS upgrade to 1.49 - https://phabricator.wikimedia.org/T207706 (10ayounsi) Added doc on how to upgrade LibreNMS https://wikitech.wikimedia.org/wiki/LibreNMS#Upgrade_LibreNMS [18:12:39] 10Operations, 10observability, 10Patch-For-Review: LibreNMS upgrade to 1.51 - https://phabricator.wikimedia.org/T207706 (10ayounsi) [18:14:00] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, maybe also designate a canary?" [puppet] - 10https://gerrit.wikimedia.org/r/507312 (https://phabricator.wikimedia.org/T219556) (owner: 10Ottomata) [18:14:09] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:14:20] (03PS1) 10CDanis: swift: codfw: bump replicate concurrency for decomm hosts [puppet] - 10https://gerrit.wikimedia.org/r/507396 (https://phabricator.wikimedia.org/T221068) [18:14:59] 18:11:13 remote: aborting due to possible repository corruption on the remote side. [18:15:02] 18:11:13 fatal: protocol error: bad pack header [18:15:10] hrmm.... CI [18:15:26] hmm [18:15:28] are you sure it was't out of disk space on the local side? [18:15:41] erf [18:15:57] (03CR) 10Muehlenhoff: [C: 03+1] "One nit, but looks good" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507391 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [18:16:03] hmm. no. let me try it again [18:16:15] (03PS4) 10Dzahn: kafka: add icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/506719 (https://phabricator.wikimedia.org/T197873) [18:16:27] (03CR) 10Jbond: "Thanks for this i have been thinking of similar things in the back of my mind. will give a proper review tomorrow. My initial comment is " (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto) [18:17:42] (03PS2) 10Rush: wikitech: Provision gerrit api auth credentials [puppet] - 10https://gerrit.wikimedia.org/r/506588 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis) [18:18:40] yea, could not reproduce. works again [18:18:42] (03CR) 10CDanis: [C: 03+2] "PCC looks good https://puppet-compiler.wmflabs.org/compiler1002/16245/" [puppet] - 10https://gerrit.wikimedia.org/r/507396 (https://phabricator.wikimedia.org/T221068) (owner: 10CDanis) [18:18:50] (03CR) 10Rush: [C: 03+2] wikitech: Provision gerrit api auth credentials [puppet] - 10https://gerrit.wikimedia.org/r/506588 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis) [18:19:39] (03CR) 10Dzahn: [C: 03+2] kafka: add icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/506719 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [18:19:49] (03PS5) 10Dzahn: kafka: add icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/506719 (https://phabricator.wikimedia.org/T197873) [18:20:41] (03PS2) 10CDanis: swift: codfw: bump replicate concurrency for decomm hosts [puppet] - 10https://gerrit.wikimedia.org/r/507396 (https://phabricator.wikimedia.org/T221068) [18:20:43] (03CR) 10Ayounsi: Netbox: add j2nb support (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/507224 (owner: 10Ayounsi) [18:20:44] PROBLEM - puppet last run on labweb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:21:49] (03PS3) 10Ayounsi: Netbox: add j2nb support [puppet] - 10https://gerrit.wikimedia.org/r/507224 [18:22:01] cdanis: you are first in line to merge [18:22:05] I'm merging now [18:22:10] puppet-merge'ing rather [18:22:14] submits [18:22:25] rebases :) [18:22:28] (03PS6) 10Dzahn: kafka: add icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/506719 (https://phabricator.wikimedia.org/T197873) [18:22:40] I think I bumped into bd.808 as well [18:22:40] anyone know what's happening with mediawiki/extensions/JADE being renamed? currently breaking branch-cut since it wasn't updated in tools-release repo. [18:23:04] PROBLEM - puppet last run on labweb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:23:41] thcipriani it's now at https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/extensions/Jade [18:24:08] !log running puppet on ms-be2014 to bump replication concurrency T221068 [18:24:09] is this in use? I don't know what's going to happen when we update localization [18:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:12] T221068: decom ms-be201[345] - https://phabricator.wikimedia.org/T221068 [18:24:14] cdanis: ^^ could that be related to your puppet-merge https://phabricator.wikimedia.org/T221529#5143984 [18:24:21] thcipriani the other repo is archived [18:24:24] (read only) [18:24:33] aka https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/extensions/JADE [18:24:54] (03PS2) 10Dzahn: transparency report: allow members of LDAP 'nda' to see private site [puppet] - 10https://gerrit.wikimedia.org/r/506848 (https://phabricator.wikimedia.org/T221744) [18:24:56] thcipriani: Looks like neither repo has any merges recnetly [18:25:19] And 1.34.0-wmf.1 used JADE not Jade anyway [18:25:34] Some documentation being all that got missed out [18:25:35] https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/Jade/+/5221a7a8ccf31a27870034de20073789a828b44e%5E%21/ [18:25:54] RECOVERY - Long running screen/tmux on notebook1003 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [18:26:27] Was it only recently marked readonly though? [18:26:49] i think so [18:26:54] there's a task for this [18:27:01] yeah https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/JADE/+/refs/meta/config%5E%21/#F0 [18:27:26] https://phabricator.wikimedia.org/T221437 [18:27:55] (03CR) 10Dzahn: [C: 03+2] transparency report: allow members of LDAP 'nda' to see private site [puppet] - 10https://gerrit.wikimedia.org/r/506848 (https://phabricator.wikimedia.org/T221744) (owner: 10Dzahn) [18:27:57] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/16246/netmon1002.wikimedia.org/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507224 (owner: 10Ayounsi) [18:29:41] I am going to unarchive JADE for the time being, make a new task about how to roll this out via train because I don't feel like it was done properly and I worry about the consequences to l10n. [18:29:43] (03PS3) 10Ayounsi: Add PHP 7.2 to LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/507391 (https://phabricator.wikimedia.org/T207706) [18:29:47] (03Abandoned) 10Ammarpad: Increase default thumb size to 260px on Dutch Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490395 (https://phabricator.wikimedia.org/T215106) (owner: 10Ammarpad) [18:30:35] (03CR) 10Ayounsi: "Thanks." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507391 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [18:31:04] jbond42: the catalog fetch fails? possibly [18:31:14] jbond42: I get the sense there's been several merges in a row today [18:32:28] 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: decommission frav1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T222109 (10Dzahn) p:05Triage→03Normal [18:35:21] cdanis: sorry i thought you missed y comment so i didn;t follow up [18:35:36] in this case i think it was just a coincidence, that box is having problems [18:35:44] oh both the labweb ones? [18:37:07] (03CR) 10Ottomata: [C: 03+2] "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/507312 (https://phabricator.wikimedia.org/T219556) (owner: 10Ottomata) [18:37:13] i just checked 1002, was just waiting to see if aanother patch was comming in, figuered omeone was working on it [18:37:14] (03PS2) 10Ottomata: Add cumin aliasaes for schema* [puppet] - 10https://gerrit.wikimedia.org/r/507312 (https://phabricator.wikimedia.org/T219556) [18:37:18] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add cumin aliasaes for schema* [puppet] - 10https://gerrit.wikimedia.org/r/507312 (https://phabricator.wikimedia.org/T219556) (owner: 10Ottomata) [18:38:26] (03CR) 10Dzahn: [C: 03+1] "compared to the way PHP7.2 is installed for phabricator on stretch and looks all the same" [puppet] - 10https://gerrit.wikimedia.org/r/507391 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [18:39:12] (03CR) 10Dzahn: [C: 03+1] "> Is there anything else that needs to be done for the app uses PHP 7.2 or it will pickup the proper version automatically?" [puppet] - 10https://gerrit.wikimedia.org/r/507391 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [18:39:21] cdanis: cloud admin are looking into it [18:40:12] (03CR) 10Ottomata: Refactor eventgate-analytics to eventgate (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/506166 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata) [18:40:50] !log running puppet on ms-be201[3,5] to bump replication concurrency T221068 [18:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:55] T221068: decom ms-be201[345] - https://phabricator.wikimedia.org/T221068 [18:41:28] RECOVERY - puppet last run on labweb1001 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [18:42:29] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/16247/netmon1002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/507391 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [18:43:58] RECOVERY - puppet last run on labweb1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:48:18] (03CR) 10Dzahn: [V: 03+1 C: 03+2] Add PHP 7.2 to LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/507391 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [18:48:28] (03PS4) 10Dzahn: Add PHP 7.2 to LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/507391 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [18:48:58] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [18:49:58] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [18:52:11] (03CR) 10Dzahn: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/507391 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [18:55:58] (03PS7) 10Ottomata: Refactor eventgate-analytics to eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/506166 (https://phabricator.wikimedia.org/T218346) [18:57:59] (03PS8) 10Ottomata: Refactor eventgate-analytics to eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/506166 (https://phabricator.wikimedia.org/T218346) [18:58:33] (03CR) 10Ottomata: "Ok, instead of renaming and modifying eventgate-analytics, I've left eventgate-analytics in place until we are done migrating to the new e" [deployment-charts] - 10https://gerrit.wikimedia.org/r/506166 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata) [18:59:09] (03PS1) 10BBlack: Convert most DYNA into CNAME to wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/507399 (https://phabricator.wikimedia.org/T208263) [18:59:11] (03PS1) 10BBlack: Change CNAME->DYNA TTLs from 1H to 1D [dns] - 10https://gerrit.wikimedia.org/r/507400 (https://phabricator.wikimedia.org/T208263) [19:00:04] thcipriani: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train - Americas version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190430T1900). [19:00:14] * thcipriani works on it [19:02:06] (03PS9) 10Ottomata: Refactor eventgate-analytics to eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/506166 (https://phabricator.wikimedia.org/T218346) [19:02:50] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Refactor eventgate-analytics to eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/506166 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata) [19:03:31] thcipriani: would love to test some things, let me know when you deploy to test wiki [19:03:39] ottomata: will do [19:03:42] danke [19:03:49] might be a few though, running a bit behind :( [19:10:18] (03CR) 10Andrew Bogott: "The compiler found one false positive, but things look good:" [puppet] - 10https://gerrit.wikimedia.org/r/507340 (https://phabricator.wikimedia.org/T167293) (owner: 10Andrew Bogott) [19:10:46] s'ok [19:12:28] PROBLEM - puppet last run on puppetmaster2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:15:43] (03PS1) 10Dzahn: netmon: switch from PHP 7.0 to PHP 7.2 for LibreNMS upgrade [puppet] - 10https://gerrit.wikimedia.org/r/507402 (https://phabricator.wikimedia.org/T207706) [19:19:26] (03PS2) 10Dzahn: netmon: switch from PHP 7.0 to PHP 7.2 for LibreNMS upgrade [puppet] - 10https://gerrit.wikimedia.org/r/507402 (https://phabricator.wikimedia.org/T207706) [19:21:17] (03PS1) 10Ottomata: eventgate-analytics - Add cirrussearch-request to stream-config.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/507403 (https://phabricator.wikimedia.org/T214080) [19:21:30] (03CR) 10Dzahn: "> Is there anything else that needs to be done for the app uses PHP 7.2 or it will pickup the proper version automatically?" [puppet] - 10https://gerrit.wikimedia.org/r/507391 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [19:22:01] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics - Add cirrussearch-request to stream-config.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/507403 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata) [19:22:31] (03CR) 10Dzahn: [C: 03+2] netmon: switch from PHP 7.0 to PHP 7.2 for LibreNMS upgrade [puppet] - 10https://gerrit.wikimedia.org/r/507402 (https://phabricator.wikimedia.org/T207706) (owner: 10Dzahn) [19:23:29] (03CR) 10Dzahn: "You see this makes it easy to revert and go back to 7.0 while we don't have to revert your large patch and can have both packages at the s" [puppet] - 10https://gerrit.wikimedia.org/r/507402 (https://phabricator.wikimedia.org/T207706) (owner: 10Dzahn) [19:24:34] !log otto@deploy1001 scap-helm eventgate-analytics upgrade production -f eventgate-analytics/analytics/eventgate-analytics-staging-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [19:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:45] !log otto@deploy1001 scap-helm eventgate-analytics upgrade production -f eventgate-analytics/eventgate-analytics-staging-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [19:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:34] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics/eventgate-analytics-staging-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [19:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:36] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [19:25:36] !log otto@deploy1001 scap-helm eventgate-analytics finished [19:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:48] !log otto@deploy1001 scap-helm eventgate-analytics upgrade production -f eventgate-analytics/eventgate-analytics-eqiad-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: eqiad] [19:26:49] !log otto@deploy1001 scap-helm eventgate-analytics cluster eqiad completed [19:26:49] !log otto@deploy1001 scap-helm eventgate-analytics finished [19:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:53] !log otto@deploy1001 scap-helm eventgate-analytics upgrade production -f eventgate-analytics/eventgate-analytics-codfw-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: codfw] [19:27:55] !log otto@deploy1001 scap-helm eventgate-analytics cluster codfw completed [19:27:55] !log otto@deploy1001 scap-helm eventgate-analytics finished [19:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:37] (03PS1) 10Thcipriani: Group0 to 1.34.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507405 [19:31:41] PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:34:03] RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:34:35] looks like we've got a lot of old branches to clean up :\ [19:34:52] oh? [19:35:36] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@4360316]: Redeploy GUI for fixes T222133, T222129, T222181, T222182 [19:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:45] T222181: “Edit visually” broken on embed.htm - https://phabricator.wikimedia.org/T222181 [19:35:45] T222182: Short URL doesn’t work on embed.html - https://phabricator.wikimedia.org/T222182 [19:35:45] T222129: WDQS link back to WDQS from a rendered result doesn't show the SPARQL used to create the report - https://phabricator.wikimedia.org/T222129 [19:35:46] T222133: "Edit SPARQL" link is broken in embed.html - https://phabricator.wikimedia.org/T222133 [19:36:40] yeah, scap clean was broken for a bit so we couldn't cleanup old branches as part of train for a while [19:37:54] now we have branches going back to February all taking up the space of a MW checkout + extensions + l10n :) [19:38:04] cleaning now [19:38:05] oh my [19:38:07] k [19:38:10] (brb btw) [19:38:11] RECOVERY - puppet last run on puppetmaster2001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [19:40:11] !log thcipriani@deploy1001 Pruned MediaWiki: 1.33.0-wmf.17 (duration: 10m 11s) [19:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:11] 10Operations, 10TechCom, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Later), and 5 others: Establish an SLA for session storage - https://phabricator.wikimedia.org/T211721 (10Eevans) >>! In T211721#5145739, @aaron wrote: > > [ ... ] > > ... My understanding is... [19:43:18] 10Operations, 10TechCom, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Later), and 5 others: Establish an SLA for session storage - https://phabricator.wikimedia.org/T211721 (10Eevans) >>! In T211721#5145107, @EvanProdromou wrote: > On a related note, do we want or... [19:43:29] !log switched netmon1002/netmon2001 from PHP 7.0 to 7.2 but reverted because LibreNMS still had an issue with it [19:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:20] !log thcipriani@deploy1001 Pruned MediaWiki: 1.33.0-wmf.18 (duration: 02m 25s) [19:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:44:53] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@4360316]: Redeploy GUI for fixes T222133, T222129, T222181, T222182 (duration: 09m 17s) [19:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:01] T222181: “Edit visually” broken on embed.htm - https://phabricator.wikimedia.org/T222181 [19:45:01] T222182: Short URL doesn’t work on embed.html - https://phabricator.wikimedia.org/T222182 [19:45:02] T222129: WDQS link back to WDQS from a rendered result doesn't show the SPARQL used to create the report - https://phabricator.wikimedia.org/T222129 [19:45:02] T222133: "Edit SPARQL" link is broken in embed.html - https://phabricator.wikimedia.org/T222133 [19:47:35] !log thcipriani@deploy1001 Pruned MediaWiki: 1.33.0-wmf.19 (duration: 02m 24s) [19:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:49] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:51:19] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) [19:52:02] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) [19:53:24] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission osm-db200[12] and osm-web200[1234] - https://phabricator.wikimedia.org/T187445 (10Papaul) [19:53:48] baack [19:55:11] (03PS1) 10Jcrespo: mariadb: Reenable notifications for db2098-db2101 [puppet] - 10https://gerrit.wikimedia.org/r/507407 (https://phabricator.wikimedia.org/T220572) [19:56:29] !log thcipriani@deploy1001 Pruned MediaWiki: 1.33.0-wmf.20 (duration: 02m 07s) [19:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:08] !log thcipriani@deploy1001 Started scap: testwiki to 1.34.0-wmf.3 and rebuild l10n cache [19:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:56] (03PS1) 10Ottomata: eventgate - include error_stream in default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/507409 (https://phabricator.wikimedia.org/T218346) [19:59:30] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission osm-db200[12] and osm-web200[1234] - https://phabricator.wikimedia.org/T187445 (10Papaul) [19:59:34] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate - include error_stream in default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/507409 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata) [20:03:19] (03PS1) 10Ottomata: eventgate - properly indent stream_config [deployment-charts] - 10https://gerrit.wikimedia.org/r/507411 (https://phabricator.wikimedia.org/T218346) [20:10:25] (03PS2) 10Ottomata: eventgate - properly indent stream_config [deployment-charts] - 10https://gerrit.wikimedia.org/r/507411 (https://phabricator.wikimedia.org/T218346) [20:11:10] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate - properly indent stream_config [deployment-charts] - 10https://gerrit.wikimedia.org/r/507411 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata) [20:21:30] !log netmon1002 - loading PHP 7.2 module to debug issue for librenms. librenms very short downtime [20:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:37] PROBLEM - High CPU load on API appserver on mw1282 is CRITICAL: CRITICAL - load average: 77.23, 34.66, 21.62 [20:25:51] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 52.05, 22.07, 14.45 [20:26:15] PROBLEM - High CPU load on API appserver on mw1285 is CRITICAL: CRITICAL - load average: 64.70, 35.23, 22.35 [20:26:16] cdb rebuild step causing ^ ? maybe? [20:26:39] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 51.10, 27.14, 18.63 [20:27:03] PROBLEM - High CPU load on API appserver on mw1283 is CRITICAL: CRITICAL - load average: 60.65, 29.06, 19.43 [20:27:07] PROBLEM - High CPU load on API appserver on mw1290 is CRITICAL: CRITICAL - load average: 60.90, 29.39, 19.31 [20:27:11] RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 22.56, 19.89, 14.32 [20:27:11] hrm, nope, at least not for 1285, just hhvm angry [20:27:13] PROBLEM - High CPU load on API appserver on mw1280 is CRITICAL: CRITICAL - load average: 67.60, 32.98, 20.61 [20:27:15] RECOVERY - High CPU load on API appserver on mw1282 is OK: OK - load average: 23.00, 30.96, 22.53 [20:27:15] PROBLEM - High CPU load on API appserver on mw1287 is CRITICAL: CRITICAL - load average: 61.21, 28.27, 18.72 [20:27:33] RECOVERY - High CPU load on API appserver on mw1285 is OK: OK - load average: 27.35, 30.22, 21.60 [20:27:53] almost the whole appserver fleet had a big burst of network traffic [20:27:55] in eqiad [20:27:59] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 24.36, 24.16, 18.25 [20:28:19] and a somewhat-lagged burst of CPU load [20:28:19] RECOVERY - High CPU load on API appserver on mw1283 is OK: OK - load average: 31.82, 27.75, 19.77 [20:28:25] RECOVERY - High CPU load on API appserver on mw1290 is OK: OK - load average: 28.27, 27.19, 19.43 [20:28:28] bounced back pretty quick [20:28:33] RECOVERY - High CPU load on API appserver on mw1280 is OK: OK - load average: 29.37, 29.19, 20.31 [20:28:35] RECOVERY - High CPU load on API appserver on mw1287 is OK: OK - load average: 26.44, 25.08, 18.39 [20:29:26] !log thcipriani@deploy1001 Finished scap: testwiki to 1.34.0-wmf.3 and rebuild l10n cache (duration: 31m 17s) [20:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:35] lots of requests taking longer too [20:29:48] https://grafana.wikimedia.org/d/000000580/apache-backend-timing?orgId=1&from=now-3h&to=now [20:30:10] hm [20:30:20] blip, and then dropping down again [20:30:31] the cpu / latency blip is not over [20:30:47] oh that's the % over 0.5s graph yeah [20:30:58] the network blip does not correlate with thcipriani's deploy [20:31:22] should I pause train while you all investigate? Or keep going with group0? currently new version is on testwiki only. [20:31:23] did a world event just happen? :) [20:31:32] chaomodus: frontend traffic graphs are not elevated [20:31:43] ah [20:31:58] thcipriani: testwiki seems to be very slow too? [20:32:01] nor is qps observed at apache, so it isn't a difference in cache-ability at the CDN layer [20:32:29] ottomata: that's normal, takes a while for hhvm bytecode cache to warm up [20:33:15] the delay from the deploy is really odd IMO [20:33:45] "a while" == "12 hits per server" or something like that, been a while since I looked into it [20:34:05] FWIW, I started seeing these when the cdb rebuild part of deployment started [20:35:27] thcipriani: forgive my ignorance; what are the biggest users of CDB inside WMF wikis? [20:35:38] probably the localization messages...? [20:35:46] cdanis: those are all the l10n messages for the wiki, yeah [20:35:54] "the wiki" == all wikis [20:36:06] I guess interwiki links get a lot of hits but that db must be much much smaller [20:37:25] I feel like this is not dissimilar to https://phabricator.wikimedia.org/T204871 [20:37:50] just the other side of that error [20:38:10] cdanis: Interwiki links? they're not cdb... [20:38:37] Reedy: that's not what https://www.mediawiki.org/wiki/CDB suggested, but that's all I know [20:38:43] https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/interwiki.php [20:38:50] !bug 1 | cdanis [20:38:50] cdanis: https://bugzilla.wikimedia.org/show_bug.cgi?id=1 [20:39:00] They've been php for 3 years now [20:39:01] lol [20:39:04] lol [20:39:17] I guess, the page isn't technically incorrect [20:39:22] I think, it can use cdb, but we don't [20:39:47] I don't know MW codebase well enough to code-spelunk quickly [20:40:07] * Interwiki cache, either as an associative array or a path to a constant [20:40:07] * database (.cdb) file. [20:40:20] ahh, amusingly, https://www.mediawiki.org/wiki/Interwiki_cache gets it right [20:40:23] Until 2015, Wikimedia used it to configure the path to a CDB file that is loaded from disk when needed. Since 2016, it is also supported to set $wgInterwikiCache directly to an array. This is typically done by storing the array in a PHP file containing https://github.com/wikimedia/operations-mediawiki-config/commit/5bc3b88a0488e96b7473c7ceeb815b78ea5e9bb9#diff-be231341e8f4ecc1a4106d690593dac6 [20:42:10] interesting, logstash does not show an increase in fatals for the time interval in question here (20:23-20:30 was the worst of the latency spike) [20:42:57] (03PS3) 10Gehel: Enable revision fetches in production [puppet] - 10https://gerrit.wikimedia.org/r/504990 (https://phabricator.wikimedia.org/T217897) (owner: 10Smalyshev) [20:43:46] I'm not sure if that increase in network traffic beforehand is usual for a deploy, either [20:43:53] (03CR) 10Gehel: [C: 03+2] Enable revision fetches in production [puppet] - 10https://gerrit.wikimedia.org/r/504990 (https://phabricator.wikimedia.org/T217897) (owner: 10Smalyshev) [20:44:17] but ah well, site seems fine, things look normal-ish now (appserver cluster is maybe using a hair more RAM than before, but not to an alarming degree) [20:45:06] huh, we didn't get a 60 second timeout thing, that's...bizarre for a recent deployment? maybe? I guess it's been a few weeks since I did one. [20:45:12] we did! but it was before that [20:45:19] oh [20:45:25] 19:40-19:45 or so [20:46:15] (03CR) 10Dzahn: "When switching to 7.2 in prod librenms would fail to work. After some debugging (had to turn on error_reporting etc to see what caused the" [puppet] - 10https://gerrit.wikimedia.org/r/507391 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [20:46:24] er, 19:30-ish for the start [20:46:26] ah [20:46:30] which matches up better with your deploy [20:46:58] pruning wikiversions does sync some code: https://tools.wmflabs.org/sal/log/AWpvwwDJOwpQ-3Pku3J5 [20:47:03] rsync --delete [20:47:25] oh and that will also clear the HHVM bytecode cache? [20:47:42] unsure [20:47:52] it hasn't in the past afaicr [20:48:21] for old enough versions of MW, i can't imagine the hhvm bytecode cache still references them [20:48:30] should mostly be stat-ing a bunch of files and the files its removing shouldn't be in the bytecode ca...yeah ^ [20:48:52] yeah [20:48:54] hm [20:48:56] strange. [20:49:36] (03PS1) 10Dzahn: librenms: ensure php7.2-ldap is installed [puppet] - 10https://gerrit.wikimedia.org/r/507487 (https://phabricator.wikimedia.org/T207706) [20:50:19] (03PS3) 10ArielGlenn: split up page content jobs with max bytes per page range [dumps] - 10https://gerrit.wikimedia.org/r/507268 (https://phabricator.wikimedia.org/T221504) [20:50:25] the differences in network traffic patterns on the appservers a bit is odd to me [20:50:46] (03PS2) 10Dzahn: librenms: ensure php7.2-ldap is installed [puppet] - 10https://gerrit.wikimedia.org/r/507487 (https://phabricator.wikimedia.org/T207706) [20:50:52] how so? [20:50:56] https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-instance=All&from=now-1h&to=now and flop open 'network per host' [20:51:20] most of them rx'd more than they tx'd, but a few of them had a big spike of tx [20:51:21] (03CR) 10Dzahn: [C: 03+2] librenms: ensure php7.2-ldap is installed [puppet] - 10https://gerrit.wikimedia.org/r/507487 (https://phabricator.wikimedia.org/T207706) (owner: 10Dzahn) [20:52:11] like mw1251 [20:52:26] like a proportionally huge spike [20:53:14] like all of them show the same pattern except for two with the tx spikes [20:53:16] neat [20:53:18] that is weird [20:53:37] three -- mw1251, mw1268, mw1320 [20:53:51] ah missed that last one [20:53:52] which started at approx the same time, 20:15 [20:57:31] it's funny how much larger those spikes are than the event rx spikes [20:57:41] so looking at the logstash scap dashboard merge-cdb-updates finished up close to the time ofthe end of the spike on 1251: Updated 417 CDB files(s) in /srv/mediawiki/php-1.34.0-wmf.3/cache/l10n 2019-04-30T20:26:18 [20:59:05] PROBLEM - Check systemd state on ms-be2015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:00:27] (03PS1) 10Clarakosi: Add support for OpenAPI 3.0 [software/service-checker] - 10https://gerrit.wikimedia.org/r/507489 (https://phabricator.wikimedia.org/T218218) [21:03:38] !log librenms - switched from PHP 7.0 to PHP 7.2 succesful now. reverted manual changes for debugging on netmon1002 [21:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:52] !log netmon2001 - apt-get remove --purge php7.0* [21:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:19] (03CR) 10Ppchelko: [C: 03+1] Add support for OpenAPI 3.0 [software/service-checker] - 10https://gerrit.wikimedia.org/r/507489 (https://phabricator.wikimedia.org/T218218) (owner: 10Clarakosi) [21:05:44] cdanis: chaomodus still digging? can I go ahead with 1.34.0-wmf.3 to group0? [21:05:55] oh, no, sorry, I think it's fine [21:05:58] proceed [21:06:06] Yah it seems okay from my perspective [21:06:07] !log netmon2001 - apt-get install php-common php-pear (pending upgrades) [21:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:10] thanks :) [21:06:11] just was playing with graphs a bit :) [21:06:12] there's some strange stuff here but nothing alarming [21:06:19] agreed [21:06:27] and it's not like I actually know anything about the appservers or scap in the first place 🙃 [21:07:47] I'm not seeing how these things could correlate [21:07:51] but yah same [21:08:11] (03CR) 10Thcipriani: [C: 03+2] Group0 to 1.34.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507405 (owner: 10Thcipriani) [21:09:24] (03Merged) 10jenkins-bot: Group0 to 1.34.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507405 (owner: 10Thcipriani) [21:10:28] !log netmon1002 - apt-get remove --purge php 7.0* ; apt-get install php-common php-pear (pending upgrades) | netmon2001: apt autoremove [21:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:59] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/507487 working now! also i removed the 7.0 packages and cleaned up" [puppet] - 10https://gerrit.wikimedia.org/r/507391 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [21:13:33] !log thcipriani@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.34.0-wmf.3 [21:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:03] 10Operations, 10TechCom, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Later), and 5 others: Establish an SLA for session storage - https://phabricator.wikimedia.org/T211721 (10EvanProdromou) >>! In T211721#5148423, @Eevans wrote: >>>! In T211721#5145107, @EvanProd... [21:15:07] ottomata: FYI, 1.34.0-wmf.3 live on group0 [21:15:27] thank you! [21:15:33] its working [21:24:45] 10Operations, 10observability, 10Patch-For-Review: LibreNMS upgrade to 1.51 - https://phabricator.wikimedia.org/T207706 (10Dzahn) - [[ https://gerrit.wikimedia.org/r/507391 | first change ]] installed the 7.2 packages but did not change which Apache module was loaded, so still used 7.0 while both packages we... [21:24:59] (03CR) 10jenkins-bot: Group0 to 1.34.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507405 (owner: 10Thcipriani) [21:25:22] 10Operations, 10Traffic, 10Patch-For-Review: tagged_interface sometimes exceeds IFNAMSIZ - https://phabricator.wikimedia.org/T209707 (10Andrew) *bump* I still need something like https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/474272/ in order to get cloudvirt1024 online (and to pave the way towards... [21:31:41] 10Operations, 10ops-codfw: wtp2019 shows error messages in the racadm getsel's output - https://phabricator.wikimedia.org/T221572 (10Dzahn) p:05Triage→03Normal [21:32:40] 10Operations, 10Domains, 10Traffic: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080 (10Dzahn) p:05Triage→03Normal [21:33:30] 10Operations, 10observability, 10Wikimedia-Incident: prometheus: upgrade to 2.9.2 - https://phabricator.wikimedia.org/T222113 (10Dzahn) p:05Triage→03Normal [21:34:21] RECOVERY - Check systemd state on ms-be2015 is OK: OK - running: The system is fully operational [21:37:39] 10Operations, 10Continuous-Integration-Infrastructure: Jessie rsyslog_8.1901.0-1~bpo8+wmf1_amd64.deb package fails to upgrade - https://phabricator.wikimedia.org/T222166 (10Dzahn) p:05Triage→03High [21:38:03] 10Operations, 10observability, 10Wikimedia-Incident: figure out why Kafka dashboard hammers Prometheus, and fix it - https://phabricator.wikimedia.org/T222112 (10Dzahn) p:05Triage→03High [21:38:33] 10Operations, 10observability, 10Wikimedia-Incident: prometheus: some sort of IRC alerts on restarts? - https://phabricator.wikimedia.org/T222108 (10Dzahn) p:05Triage→03Normal [21:40:27] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Epic: [Epic] Scaling strategy for Wikidata Query Service - https://phabricator.wikimedia.org/T221938 (10Dzahn) p:05Triage→03Normal [21:40:41] 10Operations, 10observability, 10Wikimedia-Incident: prometheus: usable dashboard for meta-metrics about Prometheus itself (query durations etc) - https://phabricator.wikimedia.org/T222102 (10Dzahn) p:05Triage→03Normal [21:41:00] 10Operations, 10cloud-services-team: labstore100[45]/labpuppetmaster/labtestpuppetmaster2001: Broken package state; would upgrade to sysvinit - https://phabricator.wikimedia.org/T222148 (10Dzahn) p:05Triage→03High [21:42:32] 10Operations, 10TechCom, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Later), and 5 others: Establish an SLA for session storage - https://phabricator.wikimedia.org/T211721 (10EvanProdromou) >>! In T211721#5009838, @aaron wrote: > I think 10x of a normal SET in a... [21:43:51] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:44:06] !log Deployed patch for T222036 (1.34.0-wmf.1 and 1.34.0-wmf.3) [21:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:19] !log Deployed patch for T222038 (1.34.0-wmf.1 and 1.34.0-wmf.3) [21:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:07] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:49:42] (03PS2) 10Andrew Bogott: wmcs: Remove puppet code for the 'main' region [puppet] - 10https://gerrit.wikimedia.org/r/507340 (https://phabricator.wikimedia.org/T167293) [21:49:44] (03PS1) 10Andrew Bogott: OpenStack: Update firewall defines to remove references to things in ::main:: [puppet] - 10https://gerrit.wikimedia.org/r/507505 [21:49:46] (03PS1) 10Andrew Bogott: labtest/codfw-dev: remove some dangling references to the main region [puppet] - 10https://gerrit.wikimedia.org/r/507506 [21:49:48] (03PS1) 10Andrew Bogott: wmcs: update or remove some old references to the main region [puppet] - 10https://gerrit.wikimedia.org/r/507507 [21:49:50] (03PS1) 10Andrew Bogott: prometheus: update references to the no-longer-existing 'main' deploy [puppet] - 10https://gerrit.wikimedia.org/r/507508 [21:49:52] (03PS1) 10Andrew Bogott: wmcs: remove hiera references to the now-deleted main deploy [puppet] - 10https://gerrit.wikimedia.org/r/507509 [21:51:48] (03PS1) 10CRusnov: Update requirements and artifacts for Netbox v2.5.11 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/507510 [21:52:58] !log mobrovac@deploy1001 Started deploy [restbase/deploy@b3b140f] (dev-cluster): Parsoid: use the new stashing tables for old revisions too [21:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:20] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@b3b140f] (dev-cluster): Parsoid: use the new stashing tables for old revisions too (duration: 03m 22s) [21:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:31] !log mobrovac@deploy1001 Started deploy [restbase/deploy@b3b140f]: Parsoid: Use the new stash tables for old revisions - T215956 [21:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:36] T215956: Consider stashing data-parsoid for VE - https://phabricator.wikimedia.org/T215956 [22:04:20] (03CR) 10Mobrovac: [C: 04-1] "LGTM, one comment in-lined." (031 comment) [software/service-checker] - 10https://gerrit.wikimedia.org/r/507489 (https://phabricator.wikimedia.org/T218218) (owner: 10Clarakosi) [22:21:28] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@b3b140f]: Parsoid: Use the new stash tables for old revisions - T215956 (duration: 23m 56s) [22:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:32] T215956: Consider stashing data-parsoid for VE - https://phabricator.wikimedia.org/T215956 [22:47:48] 10Operations, 10DC-Ops, 10SRE-Access-Requests, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10Dzahn) We had a hangout and chat and talked about the remaining things and agreed they are currently not needed. If Willy is blocked by anything we will revisit. [22:53:17] 10Operations, 10DC-Ops, 10SRE-Access-Requests, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10Dzahn) 05Open→03Resolved [22:56:12] (03CR) 10CRusnov: "I have tested this on af-netbox." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/507510 (owner: 10CRusnov) [23:00:04] MaxSem, RoanKattouw, and Niharika: #bothumor I � Unicode. All rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190430T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:00:59] PROBLEM - Check systemd state on ms-be2015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:07:28] !log ayounsi@deploy1001 Started deploy [librenms/librenms@2094575]: Upgrade LibreNMS to 1.51 - T207481 [23:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:33] !log ayounsi@deploy1001 Finished deploy [librenms/librenms@2094575]: Upgrade LibreNMS to 1.51 - T207481 (duration: 00m 05s) [23:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:10] (03PS2) 10Paladox: Merge tag 'v2.15.13' into wmf/stable-2.15 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/505801 [23:13:22] (03PS1) 10Paladox: Update plugins to 2.15.13 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/507521 [23:14:46] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10mobrovac) >>! In T220402#5146064, @Pablo-WMDE wrote: > @mobrovac During T221755 & T221754 we tended to [[ https://ssr-termbox.w... [23:16:11] (03PS1) 10Dzahn: admins: add Joel Aufrecht to ldap_only_admins [puppet] - 10https://gerrit.wikimedia.org/r/507522 (https://phabricator.wikimedia.org/T222214) [23:16:21] (03PS4) 10ArielGlenn: split up page content jobs with max bytes per page range [dumps] - 10https://gerrit.wikimedia.org/r/507268 (https://phabricator.wikimedia.org/T221504) [23:16:24] (03PS2) 10Paladox: Update plugins to 2.15.13 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/507521 [23:17:23] PROBLEM - LibreNMS HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 287 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/LibreNMS [23:18:08] !log ayounsi@deploy1001 Started deploy [librenms/librenms@0fd8da6]: Rollback LibreNMS [23:18:10] (03PS2) 10Dzahn: admins: remove ability to run commands as user 'apache' [puppet] - 10https://gerrit.wikimedia.org/r/506750 (https://phabricator.wikimedia.org/T78076) [23:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:13] !log ayounsi@deploy1001 Finished deploy [librenms/librenms@0fd8da6]: Rollback LibreNMS (duration: 00m 05s) [23:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:41] RECOVERY - LibreNMS HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 200 OK - 8737 bytes in 0.032 second response time https://wikitech.wikimedia.org/wiki/LibreNMS [23:19:57] (03PS1) 10Paladox: Remove quota plugin [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/507523 [23:22:53] 10Operations, 10observability, 10Patch-For-Review: LibreNMS upgrade to 1.51 - https://phabricator.wikimedia.org/T207706 (10ayounsi) I tried to deploy it once again. 1/ They replaced log_file with log_dir, this will need a puppet change I temporarily worked around it but: 2/ App is not loading and this is s... [23:30:11] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [23:30:11] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [23:30:25] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [23:31:03] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [23:31:19] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [23:31:19] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [23:34:19] (03CR) 10ArielGlenn: [C: 03+2] split up page content jobs with max bytes per page range [dumps] - 10https://gerrit.wikimedia.org/r/507268 (https://phabricator.wikimedia.org/T221504) (owner: 10ArielGlenn) [23:35:03] RECOVERY - Check systemd state on ms-be2015 is OK: OK - running: The system is fully operational [23:35:40] !log ariel@deploy1001 Started deploy [dumps/dumps@d715ea0]: determine page ranges of content output files by cumul revision length as well as rev count [23:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:43] !log ariel@deploy1001 Finished deploy [dumps/dumps@d715ea0]: determine page ranges of content output files by cumul revision length as well as rev count (duration: 00m 03s) [23:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:51] deploy and sleep, that's me [23:36:52] it's exciting to see gerrit's metrics showing improvements since yesturday! (threads are lower than they have been for the whole month!) [23:37:07] (but the job won't start until tomorrow 11 am my time so it's all good) [23:37:38] (03PS1) 10Papaul: DNS: Remoce mgmt and production DNS for db2014,db2020,db2021,db2022,db2024,db2031 [dns] - 10https://gerrit.wikimedia.org/r/507525 [23:43:30] (03PS1) 10Ayounsi: LibreNMS, add log dir [puppet] - 10https://gerrit.wikimedia.org/r/507526 (https://phabricator.wikimedia.org/T207706) [23:46:30] (03CR) 10Dzahn: [C: 03+1] LibreNMS, add log dir [puppet] - 10https://gerrit.wikimedia.org/r/507526 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [23:47:46] (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/16251/netmon1002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/507526 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi) [23:48:59] paladox: very nice :) [23:49:06] bbiaw [23:49:06] yup :) [23:49:49] !log ayounsi@deploy1001 Started deploy [librenms/librenms@2094575]: Upgrade LibreNMS to 1.51 - T207481 [23:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:54] !log ayounsi@deploy1001 Finished deploy [librenms/librenms@2094575]: Upgrade LibreNMS to 1.51 - T207481 (duration: 00m 04s) [23:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:39] !log ayounsi@deploy1001 Started deploy [librenms/librenms@0fd8da6]: Rollback LibreNMS [23:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:44] !log ayounsi@deploy1001 Finished deploy [librenms/librenms@0fd8da6]: Rollback LibreNMS (duration: 00m 05s) [23:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:57] i keep getting this error every time i try to open wikipedia from google [23:58:04] Request from 2601:14a:c201:2d54:a42a:38c9:9ec7:44e1 via cp1085 cp1085, Varnish XID 866091559 Error: 503, Backend fetch failed at Tue, 30 Apr 2019 23:57:35 GMT [23:58:47] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:58:49] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [23:58:49] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [23:58:55] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:58:59] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [23:59:05] i guess that must be it ^^ [23:59:15] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:59:41] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:59:43] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [23:59:47] mutante: XioNoX ^