[00:30:12] (03PS1) 10EBernhardson: Drop query_clicks partitions after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/419954 (https://phabricator.wikimedia.org/T189845) [00:30:57] (03CR) 10jerkins-bot: [V: 04-1] Drop query_clicks partitions after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/419954 (https://phabricator.wikimedia.org/T189845) (owner: 10EBernhardson) [00:33:56] (03PS2) 10EBernhardson: Drop query_clicks partitions after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/419954 (https://phabricator.wikimedia.org/T189845) [00:34:41] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [00:47:15] (03CR) 10Ayounsi: [C: 032] LibreNMS: IRC alerts on -operations [puppet] - 10https://gerrit.wikimedia.org/r/419731 (owner: 10Ayounsi) [00:47:25] (03PS2) 10Ayounsi: LibreNMS: IRC alerts on -operations [puppet] - 10https://gerrit.wikimedia.org/r/419731 [01:00:20] !log reedy@tin Synchronized php-1.31.0-wmf.25/includes/specials/pagers/NewFilesPager.php: Fix T189846 (duration: 00m 58s) [01:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:25] T189846: PHP Fatal Error: Invalid operand type was used: cannot perform this operation with arrays - https://phabricator.wikimedia.org/T189846 [01:18:01] (03CR) 10Krinkle: navtiming.py: Make sure to record country specific when oversampling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/419738 (https://phabricator.wikimedia.org/T189780) (owner: 10Imarlier) [01:21:23] (03CR) 10Krinkle: navtiming.py: Make sure to record country specific when oversampling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/419738 (https://phabricator.wikimedia.org/T189780) (owner: 10Imarlier) [01:22:30] (03CR) 10Krinkle: navtiming.py: Make sure to record country specific when oversampling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/419738 (https://phabricator.wikimedia.org/T189780) (owner: 10Imarlier) [01:22:39] (03CR) 10Krinkle: [C: 04-1] navtiming.py: Make sure to record country specific when oversampling [puppet] - 10https://gerrit.wikimedia.org/r/419738 (https://phabricator.wikimedia.org/T189780) (owner: 10Imarlier) [01:30:19] 10Operations, 10Discovery-Search: Additional network ports for elasticsearch servers? - https://phabricator.wikimedia.org/T189854#4055439 (10EBernhardson) [01:46:34] !log librenms IRC bot moved to -operations channel. Doc on how to turn it off is on https://wikitech.wikimedia.org/wiki/LibreNMS#IRC_Alerting [01:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:02:12] (03CR) 10BryanDavis: [C: 031] toolsdb: Remove stale accounts if present in maintain-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/419630 (https://phabricator.wikimedia.org/T188680) (owner: 10Bstorm) [02:04:44] 10Operations, 10ops-eqiad, 10netops: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960#4055495 (10ayounsi) [02:06:12] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#4055496 (10ayounsi) [02:06:45] (03PS6) 10Ahmed123: Enable rollbacker user right at arwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419535 (https://phabricator.wikimedia.org/T189732) [02:07:06] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#3857747 (10ayounsi) [02:09:06] 10Operations, 10ops-eqiad, 10netops: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4055498 (10ayounsi) [02:11:57] (03CR) 10Bmansurov: "I have no immediate plans to deploy." [puppet] - 10https://gerrit.wikimedia.org/r/419387 (https://phabricator.wikimedia.org/T189285) (owner: 10Vgutierrez) [02:13:36] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#4055506 (10ayounsi) @Cmjohnson those too VC ports show up as down, could you please look at the cabling? ``` ayounsi@asw2-b-eqiad> show virtual-chassis v... [02:15:11] (03CR) 10jenkins-bot: Enable ShortUrl Extension at knwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417762 (https://phabricator.wikimedia.org/T189287) (owner: 10Jayprakash12345) [02:15:46] 10Operations, 10ops-eqiad, 10netops: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4055513 (10ayounsi) @Cmjohnson some ports show up at absent, some as up but with no neighbors, could you please have a look? ``` ayounsi@asw2-c-eqiad> show virtual-chassis vc-port |... [02:16:12] (03CR) 10jenkins-bot: labs: Disable reading from term_search_key from wb_terms table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419859 (https://phabricator.wikimedia.org/T189776) (owner: 10Ladsgroup) [02:17:33] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419735 (owner: 10Marostegui) [02:18:52] (03CR) 10jenkins-bot: Depool rdb1007 for kernel security update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419719 (owner: 10Muehlenhoff) [02:21:08] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419724 (owner: 10Marostegui) [02:21:58] (03CR) 10jenkins-bot: Revert "Depool rdb1007 for kernel security update" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419726 (owner: 10Muehlenhoff) [02:22:39] (03CR) 10jenkins-bot: group1 to wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419826 (owner: 10Chad) [02:23:50] (03CR) 10jenkins-bot: Enable ping from edit summary in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417329 (https://phabricator.wikimedia.org/T188469) (owner: 10MaxSem) [02:27:48] (03CR) 10jenkins-bot: robots.txt: Combine various NS_SPECIAL disallows [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411255 (owner: 10Chad) [02:29:13] (03CR) 10jenkins-bot: Revert "Depool rdb1005 for kernel security update" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419716 (owner: 10Muehlenhoff) [02:31:17] (03CR) 10jenkins-bot: New throttle rule, clean expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418875 (https://phabricator.wikimedia.org/T189442) (owner: 10Urbanecm) [02:32:19] (03CR) 10jenkins-bot: Change autoconfirmed settings and Enable flood group at zhwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418008 (https://phabricator.wikimedia.org/T189289) (owner: 10Rxy) [02:35:21] PROBLEM - Ubuntu mirror in sync with upstream on sodium is CRITICAL: /srv/mirrors/ubuntu is over 14 hours old. [02:38:39] (03CR) 10jenkins-bot: db-codfw.php: Depool db2050 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419768 (owner: 10Marostegui) [02:39:44] (03PS1) 10Ayounsi: Smokeping: remove asw- mgmt probing [puppet] - 10https://gerrit.wikimedia.org/r/419966 [02:39:46] (03CR) 10jenkins-bot: Undeploy the disabled ArticleCreationWorkflow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419492 (https://phabricator.wikimedia.org/T186570) (owner: 10MaxSem) [02:43:50] (03PS2) 10Ayounsi: Smokeping: remove asw- mgmt probing [puppet] - 10https://gerrit.wikimedia.org/r/419966 [03:29:40] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 857.49 seconds [04:05:41] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 263.13 seconds [04:46:33] (03PS8) 10Madhuvishy: NFS: add custom script to generate target hosts [puppet] - 10https://gerrit.wikimedia.org/r/406779 (https://phabricator.wikimedia.org/T185967) (owner: 10Volans) [04:47:08] (03CR) 10jerkins-bot: [V: 04-1] NFS: add custom script to generate target hosts [puppet] - 10https://gerrit.wikimedia.org/r/406779 (https://phabricator.wikimedia.org/T185967) (owner: 10Volans) [04:48:52] (03PS9) 10Madhuvishy: NFS: add custom script to generate target hosts [puppet] - 10https://gerrit.wikimedia.org/r/406779 (https://phabricator.wikimedia.org/T185967) (owner: 10Volans) [04:54:01] (03CR) 10Madhuvishy: "Thanks for all the comments volans, I responded to all of them and fixed most :) Sorry I forgot to come back to this for this long!" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/406779 (https://phabricator.wikimedia.org/T185967) (owner: 10Volans) [05:01:27] (03PS1) 10Madhuvishy: nfs: Remove config for deleted project wikidata-topicmaps [puppet] - 10https://gerrit.wikimedia.org/r/419970 [05:23:30] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [06:16:30] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419975 [06:16:33] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419975 [06:18:50] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419975 (owner: 10Marostegui) [06:20:05] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419975 (owner: 10Marostegui) [06:20:20] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419975 (owner: 10Marostegui) [06:21:25] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1084 (duration: 00m 58s) [06:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:10] !log Stop MySQL on db2045 (s8 codfw master) for maintenance [06:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:45] !log Stop MySQL on db2051 (s4 codfw master) for maintenance [06:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:40] RECOVERY - Ubuntu mirror in sync with upstream on sodium is OK: /srv/mirrors/ubuntu is over 0 hours old. [06:52:39] !log Stop MySQL on db2048 (s1 codfw master) for maintenance [06:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:17] !log Stop MySQL on es2016 (es2 codfw master) for maintenance [07:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:34] * elukey looks for alter tables [07:06:51] Not today! They will be back on Monday! [07:15:34] !log Stop MySQL on es2017 (es3 codfw master) for maintenance [07:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:24] (03PS1) 10Elukey: profile::hadoop::monitoring: add explicit dependency to cdh::hadoop [puppet] - 10https://gerrit.wikimedia.org/r/419979 (https://phabricator.wikimedia.org/T188294) [07:38:52] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10483/" [puppet] - 10https://gerrit.wikimedia.org/r/419979 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey) [07:39:11] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [07:49:13] (03PS1) 10Elukey: Assign role::analytics_cluster::hadoop::worker to analytics1077 [puppet] - 10https://gerrit.wikimedia.org/r/419980 (https://phabricator.wikimedia.org/T188294) [07:50:51] (03CR) 10Elukey: [C: 032] Assign role::analytics_cluster::hadoop::worker to analytics1077 [puppet] - 10https://gerrit.wikimedia.org/r/419980 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey) [07:53:09] !log reimage mc2036 after mainboard replacement (T185587) [07:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:16] T185587: mc2036 mainboard fuse failure - https://phabricator.wikimedia.org/T185587 [07:56:32] 10Operations, 10DBA, 10Epic: Meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107#4055677 (10Marostegui) [07:57:29] 10Operations, 10DBA, 10Epic: Meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107#4031557 (10Marostegui) [07:57:53] 10Operations, 10DBA, 10Epic: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107#4031557 (10Marostegui) [08:03:10] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4055687 (10Tgr) Quick overview of the open technical tasks: * {T186965}: ** On sites using Tidy, styles will split paragraphs they are em... [08:14:55] (03PS1) 10Elukey: profile::hadoop: add explicit ordering between daemons and jmx agent [puppet] - 10https://gerrit.wikimedia.org/r/419982 (https://phabricator.wikimedia.org/T188294) [08:19:03] RECOVERY - IPsec on mc1036 is OK: Strongswan OK - 1 ESP OK [08:29:03] !log reboot druid1005 for kernel updates [08:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:18] (03PS2) 10Giuseppe Lavagetto: Convert netbox to use the docker-pkg build system [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/419744 [08:36:28] (03PS1) 10Marostegui: db-eqiad.php: Depool es1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419983 [08:38:01] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool es1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419983 (owner: 10Marostegui) [08:38:09] (03PS1) 10Marostegui: es1015.yaml: Update socket path [puppet] - 10https://gerrit.wikimedia.org/r/419984 [08:39:12] (03Merged) 10jenkins-bot: db-eqiad.php: Depool es1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419983 (owner: 10Marostegui) [08:39:27] (03CR) 10jenkins-bot: db-eqiad.php: Depool es1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419983 (owner: 10Marostegui) [08:40:37] !log reboot druid1006 for kernel updates [08:40:41] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool es1015 for kernel, mariadb and socket upgrade (duration: 00m 58s) [08:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:15] !log Stop MySQL on es1015 for maintenance [08:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:00] (03CR) 10Marostegui: [C: 032] es1015.yaml: Update socket path [puppet] - 10https://gerrit.wikimedia.org/r/419984 (owner: 10Marostegui) [08:43:56] (03PS1) 10Jcrespo: dbproxy: Change m4-master from dbproxy1009 to dbproxy1004 [dns] - 10https://gerrit.wikimedia.org/r/419985 (https://phabricator.wikimedia.org/T183249) [08:44:59] (03CR) 10Marostegui: [C: 031] dbproxy: Change m4-master from dbproxy1009 to dbproxy1004 [dns] - 10https://gerrit.wikimedia.org/r/419985 (https://phabricator.wikimedia.org/T183249) (owner: 10Jcrespo) [08:47:42] (03CR) 10Ema: [C: 031] Drop use of experimental repository component for caches [puppet] - 10https://gerrit.wikimedia.org/r/415814 (https://phabricator.wikimedia.org/T188545) (owner: 10Muehlenhoff) [08:49:59] !log upgrade and restart of dbproxy1004 (passive) [08:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:06] 10Operations, 10ops-codfw: mc2036 mainboard fuse failure - https://phabricator.wikimedia.org/T185587#4055728 (10MoritzMuehlenhoff) 05Open>03Resolved mc2036 has been reimaged, closing. [08:50:47] (03PS4) 10Muehlenhoff: Drop use of experimental repository component for caches [puppet] - 10https://gerrit.wikimedia.org/r/415814 (https://phabricator.wikimedia.org/T188545) [08:51:37] (03CR) 10Muehlenhoff: [C: 032] Drop use of experimental repository component for caches [puppet] - 10https://gerrit.wikimedia.org/r/415814 (https://phabricator.wikimedia.org/T188545) (owner: 10Muehlenhoff) [08:53:34] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool es1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419987 [08:56:13] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool es1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419987 (owner: 10Marostegui) [08:57:27] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool es1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419987 (owner: 10Marostegui) [08:58:41] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool es1015 after kernel, mariadb and socket upgrade (duration: 00m 56s) [08:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:53] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool es1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419987 (owner: 10Marostegui) [08:59:26] (03CR) 10Elukey: [C: 031] "eventlog1002 is not in the analytics vlan so no changes are needed, and iiuc this will be transparent to eventlogging (namely after the DN" [dns] - 10https://gerrit.wikimedia.org/r/419985 (https://phabricator.wikimedia.org/T183249) (owner: 10Jcrespo) [09:01:31] (03CR) 10Filippo Giunchedi: [C: 032] lower TTL for puppetmaster-related CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/419802 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [09:01:57] (03CR) 10Jcrespo: "A common mistake we had in the past is public ip hosts being banned from access it. Can you confirm there is no 208.x host that will acces" [dns] - 10https://gerrit.wikimedia.org/r/419985 (https://phabricator.wikimedia.org/T183249) (owner: 10Jcrespo) [09:02:44] (03PS2) 10Filippo Giunchedi: install_server: use stretch for puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/419794 (https://phabricator.wikimedia.org/T184562) [09:02:56] (03PS1) 10Jcrespo: dbproxy: Remove old socket location and enable firewall [puppet] - 10https://gerrit.wikimedia.org/r/419989 (https://phabricator.wikimedia.org/T148507) [09:03:50] (03CR) 10Filippo Giunchedi: [C: 032] install_server: use stretch for puppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/419794 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [09:04:05] (03PS2) 10Jcrespo: dbproxy: Remove old socket location & enable firewall @ dbproxy1009 [puppet] - 10https://gerrit.wikimedia.org/r/419989 (https://phabricator.wikimedia.org/T148507) [09:04:33] jynus: re:419985 - no idea about m5-master, did you mean m4? [09:05:03] yes, sorry [09:05:06] m4-master [09:05:26] (03CR) 10Jcrespo: "s/m5/m4/" [dns] - 10https://gerrit.wikimedia.org/r/419985 (https://phabricator.wikimedia.org/T183249) (owner: 10Jcrespo) [09:06:00] ah okok! So the only host that I know is using m4-master is eventlog1002, don't know anything else [09:06:47] should I wait to deploy so more people are around? [09:08:49] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Convert netbox to use the docker-pkg build system [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/419744 (owner: 10Giuseppe Lavagetto) [09:09:26] jynus: from my pov we can proceed, I'll watch EL logs but I don't have any idea of other hosts needing a change for this [09:09:51] the analytics vlan has some firewall rules to allow traffic towards port 3306 of some db proxies [09:10:01] but it doesn't whitelist dbproxy1009 for example [09:10:15] (more context in https://phabricator.wikimedia.org/T189408) [09:10:22] so I guess nothing really uses it :) [09:10:23] then it doesn't affect us [09:10:33] (03PS1) 10Marostegui: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419990 (https://phabricator.wikimedia.org/T183469) [09:10:37] because this should be the same than db1009 [09:10:42] *dbproxy1009 [09:10:59] something does, as long as it is on the internal network [09:11:11] yep [09:11:44] netstat -tupn | grep 3306 shows no blocked host using it [09:11:54] but I wanted to double check it [09:12:03] thanks! [09:12:18] will deploy and monitor [09:12:36] it seems to me that eventlogging connects and disconnects many times [09:12:50] (03PS1) 10Marostegui: db1106.yaml: Disable notifications for db1106 [puppet] - 10https://gerrit.wikimedia.org/r/419991 (https://phabricator.wikimedia.org/T183469) [09:12:55] which normally is not that good, but it will work to our advantage here [09:12:59] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419990 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [09:14:13] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419990 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [09:14:28] (03PS2) 10Jcrespo: dbproxy: Change m4-master from dbproxy1009 to dbproxy1004 [dns] - 10https://gerrit.wikimedia.org/r/419985 (https://phabricator.wikimedia.org/T183249) [09:14:49] (03CR) 10Jcrespo: [C: 032] dbproxy: Change m4-master from dbproxy1009 to dbproxy1004 [dns] - 10https://gerrit.wikimedia.org/r/419985 (https://phabricator.wikimedia.org/T183249) (owner: 10Jcrespo) [09:15:34] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1106 - T183469 (duration: 00m 57s) [09:15:34] (03PS3) 10Marostegui: dbproxy: Remove old socket location & enable firewall @ dbproxy1009 [puppet] - 10https://gerrit.wikimedia.org/r/419989 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [09:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:39] T183469: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469 [09:15:45] ups, I rebased the wrong changeset :) [09:16:08] :-) [09:16:20] (03CR) 10Jcrespo: "Check for typos! :-)" [puppet] - 10https://gerrit.wikimedia.org/r/419989 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [09:16:51] !log oblivian@tin Started deploy [netbox/deploy@f3e0159]: Re-deploying with the newly built artifacts [09:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:59] (03CR) 10Marostegui: [C: 031] dbproxy: Remove old socket location & enable firewall @ dbproxy1009 [puppet] - 10https://gerrit.wikimedia.org/r/419989 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [09:17:13] (03CR) 10Marostegui: [C: 032] db1106.yaml: Disable notifications for db1106 [puppet] - 10https://gerrit.wikimedia.org/r/419991 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [09:17:37] !log oblivian@tin Finished deploy [netbox/deploy@f3e0159]: Re-deploying with the newly built artifacts (duration: 00m 47s) [09:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:44] !log oblivian@tin (no justification provided) [09:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:47] elukey: actually, I see no activity change, so it may require a reload on the script- but that can wait [09:18:50] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419990 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [09:19:35] 10Operations, 10Puppet, 10Patch-For-Review, 10User-fgiunchedi: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562#4055760 (10fgiunchedi) On Monday 19th I'll reinstall puppetmaster2001 with stretch, using the following procedure: # Depool puppetmaster2001 via... [09:19:41] (03CR) 10Jcrespo: [C: 04-2] "Not until traffic has migrated to dbproxy1004" [puppet] - 10https://gerrit.wikimedia.org/r/419989 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [09:21:38] jynus: ah so you mean that eventlogging might be caching the m4-master cname and not flipping to dbproxy1004? [09:22:34] not as much caching as "it doesn't reconnect" [09:22:41] but same effect [09:22:55] we can wait a bit more [09:23:27] as the proxies point to the same master, so no issues with that [09:24:22] (03PS1) 10Giuseppe Lavagetto: Correct frozen-requirements.txt name in the scap promote stage [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/419992 [09:24:56] (03PS4) 10Jcrespo: dbproxy: Remove old socket location & enable firewall @ dbproxy1009 [puppet] - 10https://gerrit.wikimedia.org/r/419989 (https://phabricator.wikimedia.org/T148507) [09:25:08] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Correct frozen-requirements.txt name in the scap promote stage [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/419992 (owner: 10Giuseppe Lavagetto) [09:26:42] !log oblivian@tin Started deploy [netbox/deploy@ccc342a]: Re-deploying with the newly built artifacts/2 [09:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:11] !log oblivian@tin Finished deploy [netbox/deploy@ccc342a]: Re-deploying with the newly built artifacts/2 (duration: 00m 29s) [09:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:31] (03PS1) 10Marostegui: db-eqiad.php: Restore original weight for es1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419993 [09:29:09] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore original weight for es1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419993 (owner: 10Marostegui) [09:30:27] (03Merged) 10jenkins-bot: db-eqiad.php: Restore original weight for es1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419993 (owner: 10Marostegui) [09:30:40] (03CR) 10jenkins-bot: db-eqiad.php: Restore original weight for es1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419993 (owner: 10Marostegui) [09:31:43] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore original weight for es1015 after kernel, mariadb and socket upgrade (duration: 00m 57s) [09:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:08] jynus: tcpdumping on eventlog1002 shows dbproxy1009 still used, shall I restart the mysql daemon to force the dns resolution ? [09:45:40] the mysql daemon? [09:45:49] on eventlog? [09:46:16] if you can do it without loss, yes [09:46:36] (03CR) 10Gehel: "puppet compiler is happy running on a few directly impacted nodes: https://puppet-compiler.wmflabs.org/compiler02/10485/" [puppet] - 10https://gerrit.wikimedia.org/r/419709 (owner: 10Jcrespo) [09:49:11] not sure if we may need to restart eventloging-replication too [09:50:11] jynus: sorry bad naming, there is a kafka consumer on eventlog that pushes data to m4, I can restart only that one [09:50:27] checking also the replication script [09:51:01] it depends on each application- many don't reconnect often or only on error [09:51:20] !log restart eventlogging-consumer@mysql-m4 on eventlog1002 to force the DNS resolution of m4-master (changed from dbproxy1009 -> dbproxy1004) [09:51:24] that is why dns is normally not ok for failover- but we are doing a controlled switchover here [09:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:40] I think it switched now [09:53:23] I only see the connection from db1107- probably the replication [09:54:01] it is weird because it is not logging inserts anymore, but the logs are clean [09:54:35] there you go [09:54:37] started again [09:55:58] I definitely see a connection from eventlog1002 to dbproxy1004 [09:56:19] also restarted the other mysql producer that pushes to m4 (for eventbus data) [09:56:35] !log Zuul coverage pipeline is deadlocked on an unreleased mutex. Will need a new Zuul version. [09:56:37] no more connections to port 3306 [09:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:43] on proxy1009 [09:57:14] so we are good connection wise, I will wait to check nothing is broken before restarting it [09:57:32] !log restart eventlogging-consumer@mysql-eventbus on eventlog1002 to force the DNS resolution of m4-master (changed from dbproxy1009 -> dbproxy1004) [09:57:33] PROBLEM - Nginx local proxy to apache on mwdebug2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1313 bytes in 0.151 second response time [09:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:34] RECOVERY - Nginx local proxy to apache on mwdebug2001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 623 bytes in 0.257 second response time [09:58:53] I don't see any log related to mwdebug2001 [09:59:10] maybe it crashed [09:59:13] sorry, that's me, was about to log [09:59:20] oh, sorry [09:59:41] if I don't see maintenance, I have to check [09:59:44] (03PS1) 10Urbanecm: Add more import sources to mawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419996 (https://phabricator.wikimedia.org/T188486) [09:59:46] jynus: ah yes on db1108 I can see ExecStart=/usr/local/bin/eventlogging_sync.sh -D 90 -b 1000 -d log m4-master.eqiad.wmnet localhost [09:59:59] I'll restart that as well [10:00:00] it's depooled, I'm reverting the setup that was used for the dry-run tests for ICU 57 [10:00:04] elukey: wait [10:00:28] well, restart will be safe [10:00:41] so better safe than sorry [10:00:41] yeah [10:00:45] !log reverting the HHVM/ICU 57 setup on mwdebug2001 which was used for the dry run tests [10:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:34] !log restart eventlogging_sync on db1108 (eventlogging db slave) as precautions after the change of m4-master.eqiad.wmnet's CNAME [10:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:43] logs are good [10:02:03] we should document all this [10:02:27] sure, any preference where? [10:02:37] I will try to give it a shot at mariadb/misc wikitech page [10:02:44] but it can be on eventlogging ones [10:03:00] as long as it is linked from the other place [10:03:42] the parts I don't know 100% is which of those restarts are safe/impacting, etc. [10:04:12] (03PS1) 10Muehlenhoff: Revert "Temporarily remove mwdebug2001 from debug proxy aliases" [puppet] - 10https://gerrit.wikimedia.org/r/419997 [10:05:09] we could also rethink the failing over model- if failing automatically creates issues, maybe it is better to do it manually and let errors happen instead [10:06:00] e.g. let the proxy page if host is detected as down and prevent writes [10:08:48] yes this is something that I wanted to talk with you, we'd need to re-think that failover model since as we seen the last time it doesn't fit super well eventlogging [10:09:55] possibly also forcing eventlogging (the daemon that does the mysql inserts) to stop if it can't write to db1107 for any reason [10:11:16] we can create a task for that, as I don't know most of the story [10:11:17] (03PS2) 10Urbanecm: Add more import sources to mrwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419996 (https://phabricator.wikimedia.org/T188486) [10:11:27] jynus: ack will do thanks! [10:12:05] elukey: I see no more connections to dbproxy1009, will upgrade it now, thanks for the help [10:12:43] (03CR) 10Jcrespo: dbproxy: Remove old socket location & enable firewall @ dbproxy1009 [puppet] - 10https://gerrit.wikimedia.org/r/419989 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [10:13:10] jynus: ack! I am here if needed [10:13:26] (03PS2) 10Muehlenhoff: Revert "Temporarily remove mwdebug2001 from debug proxy aliases" [puppet] - 10https://gerrit.wikimedia.org/r/419997 [10:13:39] (03PS5) 10Jcrespo: dbproxy: Remove old socket location & enable firewall @ dbproxy1009 [puppet] - 10https://gerrit.wikimedia.org/r/419989 (https://phabricator.wikimedia.org/T148507) [10:14:57] (03CR) 10Muehlenhoff: [C: 032] Revert "Temporarily remove mwdebug2001 from debug proxy aliases" [puppet] - 10https://gerrit.wikimedia.org/r/419997 (owner: 10Muehlenhoff) [10:17:42] (03CR) 10Filippo Giunchedi: [C: 031] Add puppetboard.wikimedia.org entry [dns] - 10https://gerrit.wikimedia.org/r/419800 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [10:18:02] (03CR) 10Filippo Giunchedi: [C: 031] Puppetboard: add varnish director entries [puppet] - 10https://gerrit.wikimedia.org/r/419763 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [10:19:52] !log upgrade and restart of dbproxy1009 (passive) [10:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:00] (03CR) 10Filippo Giunchedi: "LGTM overall" [puppet] - 10https://gerrit.wikimedia.org/r/419762 (https://phabricator.wikimedia.org/T184563) (owner: 10Volans) [10:22:23] (03CR) 10Filippo Giunchedi: [C: 031] "Rather opaque so sort-LGTM" [software/puppetboard/deploy] - 10https://gerrit.wikimedia.org/r/419721 (owner: 10Volans) [10:24:14] (03CR) 10Jcrespo: [C: 032] dbproxy: Remove old socket location & enable firewall @ dbproxy1009 [puppet] - 10https://gerrit.wikimedia.org/r/419989 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [10:24:21] (03PS6) 10Jcrespo: dbproxy: Remove old socket location & enable firewall @ dbproxy1009 [puppet] - 10https://gerrit.wikimedia.org/r/419989 (https://phabricator.wikimedia.org/T148507) [10:25:39] (03CR) 10Filippo Giunchedi: Initial import (032 comments) [software/puppetboard/deploy] - 10https://gerrit.wikimedia.org/r/419720 (owner: 10Volans) [10:32:23] (03CR) 10Giuseppe Lavagetto: [C: 031] cache: depool puppetmaster2001 from config-master.w.o [puppet] - 10https://gerrit.wikimedia.org/r/419795 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [10:32:47] (03CR) 10Giuseppe Lavagetto: [C: 031] Depool codfw puppetmaster [dns] - 10https://gerrit.wikimedia.org/r/419774 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [10:35:50] (03PS1) 10Jcrespo: Switchover temporarily wikireplica-web to dbproxy1010 [dns] - 10https://gerrit.wikimedia.org/r/419999 (https://phabricator.wikimedia.org/T183249) [10:38:32] (03CR) 10Marostegui: [C: 031] Switchover temporarily wikireplica-web to dbproxy1010 [dns] - 10https://gerrit.wikimedia.org/r/419999 (https://phabricator.wikimedia.org/T183249) (owner: 10Jcrespo) [10:40:10] (03CR) 10Jcrespo: [C: 032] Switchover temporarily wikireplica-web to dbproxy1010 [dns] - 10https://gerrit.wikimedia.org/r/419999 (https://phabricator.wikimedia.org/T183249) (owner: 10Jcrespo) [10:41:00] 10Operations, 10Puppet, 10Patch-For-Review, 10User-fgiunchedi: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562#3888192 (10Joe) The plan looks fine to me! [10:45:37] !log disable puppet and load balance between 3 wikirreplicas on dbproxy1010 [10:45:40] (03PS2) 10KartikMistry: lttoolbox: Update to latest upstream release [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/419346 (https://phabricator.wikimedia.org/T189075) [10:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:17] (03PS1) 10Filippo Giunchedi: Use codfw puppetmasters in ulsfo [dns] - 10https://gerrit.wikimedia.org/r/420003 [10:53:19] (03PS1) 10Filippo Giunchedi: Use codfw puppetmasters in eqsin [dns] - 10https://gerrit.wikimedia.org/r/420004 [10:53:21] (03PS1) 10Filippo Giunchedi: Use codfw puppetmasters in codfw [dns] - 10https://gerrit.wikimedia.org/r/420005 [10:53:50] !log Upgrading zuul to zuul_2.5.1-wmf4 to resolve a mutex deadlock T189859 [10:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:56] T189859: Zuul coverage pipeline is no more processing mwext-phpunit-coverage-patch jobs - https://phabricator.wikimedia.org/T189859 [10:56:52] (03PS1) 10Muehlenhoff: Remove support for experimental section [puppet] - 10https://gerrit.wikimedia.org/r/420006 [11:01:51] (03CR) 10Filippo Giunchedi: [C: 031] Remove support for experimental section [puppet] - 10https://gerrit.wikimedia.org/r/420006 (owner: 10Muehlenhoff) [11:11:10] !log zuul: reenqueue all coverage jobs lost when restarting Zuul [11:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:23] 10Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 10Zuul: Upload new zuul and jenkins-debian-glue packages to apt.wikimedia.org - https://phabricator.wikimedia.org/T186786#3955291 (10hashar) [11:13:33] 10Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 10Zuul: Upload new zuul and jenkins-debian-glue packages to apt.wikimedia.org - https://phabricator.wikimedia.org/T186786#3955291 (10hashar) [11:14:16] 10Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 10Zuul: Upload new zuul and jenkins-debian-glue packages to apt.wikimedia.org - https://phabricator.wikimedia.org/T186786#4056054 (10hashar) I have bumped Zuul to zuul_2.5.1-wmf4 for T189859 and thus updated this task description. [11:25:55] PROBLEM - Host furud is DOWN: PING CRITICAL - Packet loss = 100% [11:26:15] RECOVERY - Host furud is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms [11:35:35] (03CR) 10Filippo Giunchedi: [C: 031] Enable base::service_auto_restart for prometheus-apache-exporter [puppet] - 10https://gerrit.wikimedia.org/r/419778 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:36:15] (03CR) 10Filippo Giunchedi: "LGTM, we'll need to adjust https://gerrit.wikimedia.org/r/c/400241/ once this is merged" [puppet] - 10https://gerrit.wikimedia.org/r/415328 (owner: 10Muehlenhoff) [11:43:26] 10Operations, 10ops-eqiad: Decommission graphite1002 - https://phabricator.wikimedia.org/T187190#4056098 (10fgiunchedi) [11:44:04] 10Operations, 10ops-esams: Decommission esams ms-fe / ms-be - https://phabricator.wikimedia.org/T169518#4056099 (10fgiunchedi) [11:51:47] (03PS1) 10Muehlenhoff: Repurpose four image scalers as video scalers [puppet] - 10https://gerrit.wikimedia.org/r/420011 [11:54:47] 10Operations, 10ops-eqiad: Decommission graphite1002 - https://phabricator.wikimedia.org/T187190#4056115 (10fgiunchedi) cc #hardware-requests as per process [11:54:58] 10Operations, 10ops-eqiad, 10hardware-requests: Decommission graphite1002 - https://phabricator.wikimedia.org/T187190#4056116 (10fgiunchedi) [11:55:16] 10Operations, 10ops-esams, 10hardware-requests: Decommission esams ms-fe / ms-be - https://phabricator.wikimedia.org/T169518#4056118 (10fgiunchedi) [12:00:06] (03PS2) 10Giuseppe Lavagetto: etcd: add class for v3 basic installation [puppet] - 10https://gerrit.wikimedia.org/r/419358 (https://phabricator.wikimedia.org/T166081) [12:00:07] (03PS1) 10Giuseppe Lavagetto: profile::etcd::tlsproxy: allow more configuration options [puppet] - 10https://gerrit.wikimedia.org/r/420012 (https://phabricator.wikimedia.org/T166081) [12:00:10] (03PS1) 10Giuseppe Lavagetto: etcd::v3: add basic monitoring [puppet] - 10https://gerrit.wikimedia.org/r/420013 [12:00:12] (03PS1) 10Giuseppe Lavagetto: role: add configcluster_stretch [puppet] - 10https://gerrit.wikimedia.org/r/420014 (https://phabricator.wikimedia.org/T166081) [12:00:38] !log Run pt-table-checksum on m5 [12:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:30] (03CR) 10jerkins-bot: [V: 04-1] role: add configcluster_stretch [puppet] - 10https://gerrit.wikimedia.org/r/420014 (https://phabricator.wikimedia.org/T166081) (owner: 10Giuseppe Lavagetto) [12:02:56] !log Run pt-table-checksum on m2 [12:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:46] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10488/ this is effectively a noop." [puppet] - 10https://gerrit.wikimedia.org/r/420012 (https://phabricator.wikimedia.org/T166081) (owner: 10Giuseppe Lavagetto) [12:04:30] (03CR) 10Filippo Giunchedi: "Nice! I tried running PCC but failed though https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/10487/console" [puppet] - 10https://gerrit.wikimedia.org/r/400241 (owner: 10Dzahn) [12:06:15] PROBLEM - DPKG on dbproxy1005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:06:37] ^ that is me doing an upgrade [12:07:15] RECOVERY - DPKG on dbproxy1005 is OK: All packages OK [12:10:21] down on icinga, you will need it for the restart anyway [12:10:52] remember to log the upgrade if the restart if you can- it helps me not worry :-) [12:11:17] s/if/and [12:12:21] yeah :) [12:12:32] !log Reboot dbproxy1005 for kernel upgrade [12:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:48] (03PS1) 10Rduran: Add flake8 config and requirement [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/420015 [12:41:51] (03CR) 10Jcrespo: "Comments of what I was thinking just by reading the source code. Probably the original perl wasn't any better, but we don't need to repeat" (034 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/419725 (owner: 10Rduran) [12:46:35] (03PS6) 10Imarlier: navtiming.py: Make sure to record country specific when oversampling [puppet] - 10https://gerrit.wikimedia.org/r/419738 (https://phabricator.wikimedia.org/T189780) [12:47:59] (03CR) 10Imarlier: navtiming.py: Make sure to record country specific when oversampling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/419738 (https://phabricator.wikimedia.org/T189780) (owner: 10Imarlier) [12:48:24] 10Operations, 10cloud-services-team: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#4056221 (10chasemp) >>! In T183937#4053183, @chasemp wrote: >>>! In T183937#4051948, @RobH wrote: >> Ok, escalating this to @chasemp for completion. The systems are installed and calling into pup... [12:48:37] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#4056223 (10chasemp) https://phabricator.wikimedia.org/T183937#4056221 [12:48:49] (03PS6) 10Giuseppe Lavagetto: hhvm::admin: remove inclusion of apache::mod::proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/415573 [12:51:09] !log text-esams: reboot for kernel upgrades T188092 and to mitigate https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?panelId=7&fullscreen&orgId=1&from=1518746284946&to=1521204628041 [12:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:40] !log uploaded libsodium23/php-acpu/php-mailparse to thirdparty/php72 (deps/extentions needed by Phabricator) [12:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:36] (03CR) 10Jcrespo: "Testing on my development machine:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/419725 (owner: 10Rduran) [12:59:19] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [13:00:20] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [13:01:45] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install labmon1002 - https://phabricator.wikimedia.org/T165784#4056260 (10chasemp) 05Open>03Resolved closed in favor of T189871 [13:03:46] that's a spike of 504s in eqiad ^ [13:05:01] (03PS1) 10Rush: openstack: neutron l3-agent sysctl settings [puppet] - 10https://gerrit.wikimedia.org/r/420018 (https://phabricator.wikimedia.org/T188266) [13:06:01] (03CR) 10Rush: [C: 032] openstack: neutron l3-agent sysctl settings [puppet] - 10https://gerrit.wikimedia.org/r/420018 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [13:06:10] (03PS2) 10Rush: openstack: neutron l3-agent sysctl settings [puppet] - 10https://gerrit.wikimedia.org/r/420018 (https://phabricator.wikimedia.org/T188266) [13:06:59] (03PS1) 10Arturo Borrero Gonzalez: site.pp: put labmon1002 into work [puppet] - 10https://gerrit.wikimedia.org/r/420019 (https://phabricator.wikimedia.org/T189871) [13:07:21] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [13:08:20] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [13:08:48] (03PS4) 10Rush: openstack: trial of mixed mitaka/liberty nova compute [puppet] - 10https://gerrit.wikimedia.org/r/419737 (https://phabricator.wikimedia.org/T187954) [13:09:20] (03PS7) 10Giuseppe Lavagetto: hhvm::admin: remove inclusion of apache::mod::proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/415573 [13:09:31] (03CR) 10jerkins-bot: [V: 04-1] openstack: trial of mixed mitaka/liberty nova compute [puppet] - 10https://gerrit.wikimedia.org/r/419737 (https://phabricator.wikimedia.org/T187954) (owner: 10Rush) [13:10:48] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler03/10490/ only change is on tin, and it's legit given we don't want run hhvm on the webserver " [puppet] - 10https://gerrit.wikimedia.org/r/415573 (owner: 10Giuseppe Lavagetto) [13:11:06] (03CR) 10Rush: "http://puppet-compiler.wmflabs.org/10489/" [puppet] - 10https://gerrit.wikimedia.org/r/419737 (https://phabricator.wikimedia.org/T187954) (owner: 10Rush) [13:11:36] (03CR) 10Giuseppe Lavagetto: [C: 032] hhvm::admin: remove inclusion of apache::mod::proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/415573 (owner: 10Giuseppe Lavagetto) [13:11:46] (03PS8) 10Giuseppe Lavagetto: hhvm::admin: remove inclusion of apache::mod::proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/415573 [13:13:02] (03PS1) 10KartikMistry: apertium-spa-ita: Fix dependency [debs/contenttranslation/apertium-spa-ita] - 10https://gerrit.wikimedia.org/r/420020 [13:15:24] !log disable puppet across cloud things for safe rollout [13:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:18] (03PS2) 10Muehlenhoff: Remove support for experimental section [puppet] - 10https://gerrit.wikimedia.org/r/420006 [13:21:39] (03CR) 10Rush: [V: 032 C: 032] "override to not couple site.pp changes with role application on debian" [puppet] - 10https://gerrit.wikimedia.org/r/419737 (https://phabricator.wikimedia.org/T187954) (owner: 10Rush) [13:21:57] (03PS5) 10Rush: openstack: trial of mixed mitaka/liberty nova compute [puppet] - 10https://gerrit.wikimedia.org/r/419737 (https://phabricator.wikimedia.org/T187954) [13:22:20] (03CR) 10Rush: [V: 032 C: 032] openstack: trial of mixed mitaka/liberty nova compute [puppet] - 10https://gerrit.wikimedia.org/r/419737 (https://phabricator.wikimedia.org/T187954) (owner: 10Rush) [13:23:06] (03PS1) 10Jcrespo: cloud-dns: Point wikireplica-web to dbproxy10010 [puppet] - 10https://gerrit.wikimedia.org/r/420021 (https://phabricator.wikimedia.org/T183249) [13:23:08] (03PS5) 10Giuseppe Lavagetto: hhvm::admin: convert to using httpd instead of apache [puppet] - 10https://gerrit.wikimedia.org/r/415574 [13:25:26] (03PS2) 10Jcrespo: cloud-dns: Point wikireplica-web to dbproxy10010 [puppet] - 10https://gerrit.wikimedia.org/r/420021 (https://phabricator.wikimedia.org/T183249) [13:25:28] (03PS1) 10Jcrespo: dbproxy1011: Adapt syntax to strech and fix socket location [puppet] - 10https://gerrit.wikimedia.org/r/420022 (https://phabricator.wikimedia.org/T183249) [13:25:56] chasemp, arturo would any of you have time to help me with ^ [13:26:34] jynus: I'm right in the middle of a weird merge and checking for consequences, arturo can you lend jynus a hand? [13:27:43] sure [13:28:25] jynus: I'm reading SAL [13:28:29] (03PS3) 10Muehlenhoff: Remove support for experimental section [puppet] - 10https://gerrit.wikimedia.org/r/420006 [13:29:22] (03CR) 10Muehlenhoff: [C: 032] Remove support for experimental section [puppet] - 10https://gerrit.wikimedia.org/r/420006 (owner: 10Muehlenhoff) [13:29:34] I think things there is anything relevant on the log [13:29:38] (03PS3) 10Andrew Bogott: nova.conf: adjust db pool settings for all services [puppet] - 10https://gerrit.wikimedia.org/r/415619 (https://phabricator.wikimedia.org/T188589) [13:29:43] *don't think [13:29:58] oh, I though you were pointing me to SAL [13:30:12] no, I pointed you to the gerrit reviews [13:30:49] jynus: I have wikibugs disabled, in case you are referring to it. Link? [13:31:12] https://gerrit.wikimedia.org/r/420021 [13:31:14] (03PS3) 10Jdrewniak: Replace portals submodule with portals/deploy submodule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393239 (https://phabricator.wikimedia.org/T180777) [13:31:18] https://gerrit.wikimedia.org/r/420022 [13:32:05] (03CR) 10Andrew Bogott: [C: 032] nova.conf: adjust db pool settings for all services [puppet] - 10https://gerrit.wikimedia.org/r/415619 (https://phabricator.wikimedia.org/T188589) (owner: 10Andrew Bogott) [13:32:07] (03PS1) 10Rduran: Make sure the connection has been open [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/420024 [13:33:13] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2050" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420025 [13:33:18] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2050" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420025 [13:35:47] ok jynus, so what would you need? [13:35:52] (03PS7) 10Rduran: Add port of osc_host.sh [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/419725 [13:35:54] (03PS2) 10Rduran: Add flake8 config and requirement [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/420015 [13:36:21] the commits seems simple, if the data is right (IP addresses, ports, etc) [13:36:31] do you need me to confirm that data? [13:36:39] so do you need to run some script after that ? [13:37:30] * arturo searching [13:37:36] (03Abandoned) 10Rduran: Make sure the connection has been open [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/420024 (owner: 10Rduran) [13:37:37] what would those scripts do? [13:37:50] those scripts are "yours" [13:37:54] I cannot know [13:38:14] that is why I asking for help [13:38:43] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2050" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420025 (owner: 10Marostegui) [13:38:58] (03PS1) 10ArielGlenn: fix up svg retrieval [wikitech-static] - 10https://gerrit.wikimedia.org/r/420027 [13:39:57] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2050" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420025 (owner: 10Marostegui) [13:40:06] (03CR) 10ArielGlenn: "Haven't tested it on non-svg images yet, I just checked that response.content works for svg files." [wikitech-static] - 10https://gerrit.wikimedia.org/r/420027 (owner: 10ArielGlenn) [13:40:11] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2050" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420025 (owner: 10Marostegui) [13:41:10] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2050 (duration: 00m 58s) [13:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:05] arturo: it is ok, if you don't know either, jus tell me :-) [13:46:12] I don't know either :-) I would say we only need to reconnect the databases to the new host/port of the proxy server [13:46:31] reconnect the databases? [13:46:35] but exploring now, I see that there is the wikireplica_dns which is probably what you are referring to [13:46:58] I don't know who made those, some of your colleagues [13:48:04] (03CR) 10Ottomata: "Huh, I woulda thought require would be enough too!" [puppet] - 10https://gerrit.wikimedia.org/r/419979 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey) [13:48:37] (03CR) 10Elukey: [C: 032] "me too! But for some reason it doesn't in our settings (or maybe I am missing something) :(" [puppet] - 10https://gerrit.wikimedia.org/r/419979 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey) [13:48:49] (03CR) 10Ottomata: [C: 031] profile::hadoop: add explicit ordering between daemons and jmx agent [puppet] - 10https://gerrit.wikimedia.org/r/419982 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey) [13:49:01] ok jynus so since we are changing the IP addresses, I'm confident that we should update the DNS records, and that's what the wikireplica_dns does [13:49:13] (03CR) 10Ottomata: [C: 031] profile::hadoop: add explicit ordering between daemons and jmx agent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/419982 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey) [13:49:13] so, let's go merge and I will run the script [13:49:16] so does that have to run? [13:49:20] where? [13:49:24] in labcontrol1001 [13:49:33] ok, at leasy you know more than I do! [13:49:40] that is for sure :-) [13:49:42] (03CR) 10Elukey: profile::hadoop: add explicit ordering between daemons and jmx agent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/419982 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey) [13:50:06] ok to merge and test, and we revert if something goes wrong? [13:50:16] in the meantime between your merge and my script run, there is a possibility that new connections to web.db.svc.eqiad.wmflabs wont work [13:50:43] (because the old IP will be used until the DNS records are updated) [13:50:47] ok, let me merge and you deploy and run the script? [13:50:53] the old ip works still [13:50:58] I made sure of that [13:51:07] ok, great, then let's go :-) [13:51:31] one thing I would suggest that would make things easier [13:51:44] is to put there not A records but CNAME records [13:51:55] so I can handle the production dns easier [13:52:43] I think now there is a split brain between production and openstack [13:52:44] (03CR) 10Andrew Bogott: [C: 032] "This broke instance creation, I don't know why" [puppet] - 10https://gerrit.wikimedia.org/r/415619 (https://phabricator.wikimedia.org/T188589) (owner: 10Andrew Bogott) [13:52:57] (03PS1) 10Andrew Bogott: Revert "nova.conf: adjust db pool settings for all services" [puppet] - 10https://gerrit.wikimedia.org/r/420030 [13:53:11] (03PS3) 10Jcrespo: cloud-dns: Point wikireplica-web to dbproxy10010 [puppet] - 10https://gerrit.wikimedia.org/r/420021 (https://phabricator.wikimedia.org/T183249) [13:53:27] (03CR) 10Andrew Bogott: [V: 032 C: 032] Revert "nova.conf: adjust db pool settings for all services" [puppet] - 10https://gerrit.wikimedia.org/r/420030 (owner: 10Andrew Bogott) [13:53:57] yeah, we could totally discuss that. Perhaps it has been discussed already and I ignore it [13:54:01] (03PS4) 10Jcrespo: cloud-dns: Point wikireplica-web to dbproxy1010 [puppet] - 10https://gerrit.wikimedia.org/r/420021 (https://phabricator.wikimedia.org/T183249) [13:54:05] jynus: ping me when the merge is done [13:54:30] yes, it is rebasing + fixing typo [13:54:31] sorry [13:54:51] (03CR) 10Jcrespo: [C: 032] cloud-dns: Point wikireplica-web to dbproxy1010 [puppet] - 10https://gerrit.wikimedia.org/r/420021 (https://phabricator.wikimedia.org/T183249) (owner: 10Jcrespo) [13:55:30] PROBLEM - Disk space on kubernetes1004 is CRITICAL: DISK CRITICAL - /var/lib/kubelet/pods/4e1ee86f-2921-11e8-b60a-aa0000fe6bdf/volumes/kubernetes.iosecret/default-token-1ls38 is not accessible: Permission denied [13:55:44] arturo: I merged on puppetmaster [13:56:05] maybe you can run puppet and see if it gets executed automatically or has to be done manually? [13:56:12] ack, updating puppet [13:57:32] !log stopping nodepool temporarily during changes to nova.conf [13:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:44] tell me what you see :-) [13:57:58] jynus: puppet see the change [13:58:05] also maybe you can check the dns change taking effect ? [13:58:11] the script is doing something weird :-) [13:58:17] weird? [13:58:18] https://www.irccloud.com/pastebin/KIpcYwF6/ [13:58:29] PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [13:58:40] uhg [13:59:00] that doesn't seem right to me [13:59:10] I would revert and ask [13:59:21] wait [13:59:24] ok [13:59:39] I don't think the script is failing because what we merged [13:59:42] jynus: nodepool is on m5-master [14:00:32] maybe it will come back? [14:00:45] hashar: what do you see broken? [14:00:51] ManagerStoppedException: Manager wmflabs-eqiad is no longer running [14:00:56] it cant reach nova I guess [14:01:10] arturo: I say we revert [14:01:21] or hold and let me restart nodepool [14:01:23] althoug that is probably another issue [14:01:31] the one andrew was working on [14:01:38] andrewbogott: ^^^ [14:01:41] see log^ [14:01:47] he said he stopped it [14:01:49] ahh [14:01:58] hence the alarm :] [14:01:59] yep, it's me, unrelated [14:02:24] arturo: what do you see, is something happening? [14:02:41] are dns entries updated? did they break? [14:03:04] | fault | {u'message': u'Timed out waiting for a reply to message ID 1dc4b174c4034162b7c5155b4bdd0f41', u'code': 500, u'created': u'2018-03-16T13:55:24Z'} | [14:03:08] a minute [14:03:15] that is for a random instance. But that is from 8 minutes ago [14:03:49] jynus: andrew merged the change to do better db pool management for nova and it stopped the world, and a revert didn't immediately fix it afaiu [14:04:37] ACKNOWLEDGEMENT - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d amusso temp stop for openstack maintenance [14:05:07] chasemp: I am not worried about that [14:05:25] I can wait [14:05:46] can I help? [14:07:14] (03PS1) 10Elukey: Allow the config of maximum tolerated failed volumes for the datanode [puppet/cdh] - 10https://gerrit.wikimedia.org/r/420031 [14:07:27] jynus: is now working [14:07:33] (03CR) 10jerkins-bot: [V: 04-1] Allow the config of maximum tolerated failed volumes for the datanode [puppet/cdh] - 10https://gerrit.wikimedia.org/r/420031 (owner: 10Elukey) [14:07:42] jynus: it worked, but took soo long [14:07:44] https://www.irccloud.com/pastebin/4m1TtFTt/ [14:07:46] oh, so probably the interaction of the 2 works? [14:08:10] I will create a ticket to put ther CNAMES better [14:08:16] jynus: ok thanks [14:08:18] (at least to propose it) [14:08:30] as a warning- we will have to do 2 more times [14:08:32] jynus: any hint on how to check whether tools now see the new DB names? [14:08:44] I know the dns, I don't know where to query [14:09:17] this is probably inside our sql custom command [14:09:26] s1.analytics.db.svc.eqiad.wmflabs [14:09:31] sorry [14:09:36] the other way around [14:09:51] s1.web.db.svc.eqiad.wmflabs should point to 14 not 15 now [14:12:04] aborrero@tools-bastion-05:~$ host s1.web.db.svc.eqiad.wmflabs [14:12:04] s1.web.db.svc.eqiad.wmflabs has address 10.64.37.14 [14:12:09] jynus: ^^^ [14:12:12] ok, that means it worked [14:12:30] ok, what an adventure :-) [14:12:33] can I ask what did you do, so I don't have to ask you for the other 2 times? [14:12:49] PROBLEM - puppet last run on labnodepool1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:12:57] jynus: https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replicas#Step_7:_setting_up_DNS [14:13:41] the 2 commands, as is? [14:13:56] yeah [14:14:09] or do you prefer me to loop you in, in case I break something? [14:14:13] not necessarily you [14:14:16] someone on cloud [14:14:22] jynus: sure, why not [14:14:29] !log restarting rabbitmq on labcontrol1001 [14:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:50] jynus: I will vanish soon for lunch, in case you think on following up right now [14:14:56] (03CR) 10Muehlenhoff: [C: 031] "I remember that we rolled back the original change (293743) and I checked why to ensure we don't run into the same issue: In the original " [puppet] - 10https://gerrit.wikimedia.org/r/419709 (owner: 10Jcrespo) [14:15:12] oh, no, I need a few hours to reimage the proxy [14:15:18] now that it is not longer in use [14:16:04] (03CR) 10Jcrespo: [C: 031] "+1 based on mortiz research" [puppet] - 10https://gerrit.wikimedia.org/r/419709 (owner: 10Jcrespo) [14:16:27] 10Operations, 10DBA, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#3848018 (10aborrero) I just update wikireplica DNS records: ``` root@labcontrol1001:~# /usr/local/sbin/wikireplica_dns --aliases -v --zone web.db.svc.eqiad.wmflabs. 2018-03-16T14... [14:17:02] (03PS2) 10Elukey: Allow the config of maximum tolerated failed volumes for the datanode [puppet/cdh] - 10https://gerrit.wikimedia.org/r/420031 [14:17:46] (03CR) 10Jcrespo: [C: 031] "I assinged it to gehel only to mark that it is waiting on him to give the ok. I can marge this afterwards, but it would be nice if some of" [puppet] - 10https://gerrit.wikimedia.org/r/419709 (owner: 10Jcrespo) [14:19:39] PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - ok: 55 not-conn: cp3033_v6 [14:19:39] PROBLEM - IPsec on cp1066 is CRITICAL: Strongswan CRITICAL - ok: 55 not-conn: cp3033_v6 [14:19:39] PROBLEM - IPsec on cp1053 is CRITICAL: Strongswan CRITICAL - ok: 55 not-conn: cp3033_v6 [14:20:27] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/10494/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/420031 (owner: 10Elukey) [14:20:41] RECOVERY - IPsec on cp1065 is OK: Strongswan OK - 56 ESP OK [14:20:41] RECOVERY - IPsec on cp1066 is OK: Strongswan OK - 56 ESP OK [14:20:41] RECOVERY - IPsec on cp1053 is OK: Strongswan OK - 56 ESP OK [14:20:45] (03CR) 10Gehel: [C: 031] "This looks good to me, but @jynus pointed out that there were issues with the previous similar attempt. Moritz also had a look and it look" [puppet] - 10https://gerrit.wikimedia.org/r/419709 (owner: 10Jcrespo) [14:22:02] (03CR) 10Jcrespo: [C: 031] "I will try maybe Monday, with no promises." [puppet] - 10https://gerrit.wikimedia.org/r/419709 (owner: 10Jcrespo) [14:22:29] PROBLEM - puppet last run on labnet1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:22:59] (03PS2) 10Elukey: profile::hadoop: add explicit ordering between daemons and jmx agent [puppet] - 10https://gerrit.wikimedia.org/r/419982 (https://phabricator.wikimedia.org/T188294) [14:25:03] 10Operations, 10Cloud-Services, 10DBA, 10Patch-For-Review: db1009 overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589#4056501 (10chasemp) 05Resolved>03Open @andrew tried to merge the change to allow nova to be more gracious and it didn't work out. https://ph... [14:25:21] !log reboot druid1002 for kernel updates [14:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:11] !log restarting nodepool on nodepool1001 [14:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:29] RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [14:32:49] RECOVERY - puppet last run on labnodepool1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:35:42] (03PS2) 10KartikMistry: apertium-spa-ita: Fix dependency [debs/contenttranslation/apertium-spa-ita] - 10https://gerrit.wikimedia.org/r/420020 [14:37:20] 10Operations, 10Discovery-Search: Additional network ports for elasticsearch servers? - https://phabricator.wikimedia.org/T189854#4056525 (10EBernhardson) I'm also not sure this is worth the effort (not sure how much effort it would be). It's not necessary for standard prod usage, it would only help cluster ma... [14:38:27] (03PS1) 10Ottomata: Revert back to Kafka analytics cluster for eventlogging eventbus mysql consumer [puppet] - 10https://gerrit.wikimedia.org/r/420036 (https://phabricator.wikimedia.org/T183297) [14:39:24] (03PS2) 10Ottomata: Revert back to Kafka analytics for eventlogging eventbus mysql consumer [puppet] - 10https://gerrit.wikimedia.org/r/420036 (https://phabricator.wikimedia.org/T183297) [14:40:14] (03PS3) 10Ottomata: Revert back to Kafka analytics for eventlogging eventbus mysql consumer [puppet] - 10https://gerrit.wikimedia.org/r/420036 (https://phabricator.wikimedia.org/T183297) [14:42:29] RECOVERY - puppet last run on labnet1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:42:31] (03CR) 10Ottomata: [V: 032 C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10495/eventlog1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/420036 (https://phabricator.wikimedia.org/T183297) (owner: 10Ottomata) [14:48:25] !log reimage dbproxy1011 [14:48:28] !log reset contintcloud quotas as per https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#incorrect_quota_violations [14:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:51] 10Operations, 10DBA, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4056565 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['dbproxy1011.eqiad.wmnet'] ``` The log can be found in `/var/... [14:50:58] (03PS2) 10Jcrespo: dbproxy1011: Adapt syntax to strech and fix socket location [puppet] - 10https://gerrit.wikimedia.org/r/420022 (https://phabricator.wikimedia.org/T183249) [14:58:35] (03PS3) 10Elukey: profile::hadoop: add explicit ordering between daemons and jmx agent [puppet] - 10https://gerrit.wikimedia.org/r/419982 (https://phabricator.wikimedia.org/T188294) [14:58:37] (03CR) 10Jcrespo: [C: 032] dbproxy1011: Adapt syntax to strech and fix socket location [puppet] - 10https://gerrit.wikimedia.org/r/420022 (https://phabricator.wikimedia.org/T183249) (owner: 10Jcrespo) [14:59:13] PROBLEM - configured eth on labvirt1021 is CRITICAL: eth1 reporting no carrier. [14:59:32] PROBLEM - DPKG on labvirt1021 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:59:39] (03PS4) 10Elukey: profile::hadoop: add explicit ordering between daemons and jmx agent [puppet] - 10https://gerrit.wikimedia.org/r/419982 (https://phabricator.wikimedia.org/T188294) [14:59:41] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10492/terbium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/415574 (owner: 10Giuseppe Lavagetto) [15:00:06] (03PS6) 10Giuseppe Lavagetto: hhvm::admin: convert to using httpd instead of apache [puppet] - 10https://gerrit.wikimedia.org/r/415574 [15:00:32] RECOVERY - DPKG on labvirt1021 is OK: All packages OK [15:00:42] (03PS1) 10Rush: wip: openstack: neutron component comments [puppet] - 10https://gerrit.wikimedia.org/r/420043 [15:01:02] PROBLEM - puppet last run on labvirt1021 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 44 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[qemu-system] [15:04:36] (03PS4) 10Muehlenhoff: Switch debdeploy clients to Python 3 [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/413397 [15:06:22] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler03/10497/ looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/419982 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey) [15:08:03] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 21 probes of 300 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [15:08:19] (03PS7) 10Giuseppe Lavagetto: hhvm::admin: convert to using httpd instead of apache [puppet] - 10https://gerrit.wikimedia.org/r/415574 [15:08:54] (03CR) 10Ottomata: [C: 031] Allow the config of maximum tolerated failed volumes for the datanode [puppet/cdh] - 10https://gerrit.wikimedia.org/r/420031 (owner: 10Elukey) [15:10:36] (03PS1) 10Ottomata: Move burrow eventbus mysql lag monitor back to Kafka analytics [puppet] - 10https://gerrit.wikimedia.org/r/420044 [15:11:02] RECOVERY - puppet last run on labvirt1021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:11:46] (03PS2) 10Ottomata: Move burrow eventbus mysql lag monitor back to Kafka analytics [puppet] - 10https://gerrit.wikimedia.org/r/420044 [15:11:49] (03CR) 10Ottomata: [V: 032 C: 032] Move burrow eventbus mysql lag monitor back to Kafka analytics [puppet] - 10https://gerrit.wikimedia.org/r/420044 (owner: 10Ottomata) [15:12:58] 10Operations, 10monitoring: Monitor resource usage on a per-cgroup basis - https://phabricator.wikimedia.org/T183146#4056674 (10fgiunchedi) [15:13:00] 10Operations, 10monitoring, 10User-fgiunchedi: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027#4056676 (10fgiunchedi) [15:13:03] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 8 probes of 300 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [15:13:32] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:14:43] 10Operations, 10Cloud-Services, 10DBA, 10Patch-For-Review: db1009 overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589#4056680 (10Andrew) 05Open>03Resolved a:03Andrew >My suggestion is to close this without touching nova Works for me! [15:15:28] (03PS4) 10Giuseppe Lavagetto: hhvm: remove legacy diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/415828 [15:16:13] 10Operations, 10Cloud-Services, 10DBA, 10Patch-For-Review: db1009 overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589#4056683 (10jcrespo) Don't celebrate yet too hard, as it will increase the chances of the issue happening again :-D [15:19:19] PROBLEM - Disk space on kubernetes1001 is CRITICAL: DISK CRITICAL - /var/lib/kubelet/pods/0c6af179-292d-11e8-b60a-aa0000fe6bdf/volumes/kubernetes.iosecret/default-token-1ls38 is not accessible: Permission denied [15:19:35] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "https://grafana-admin.wikimedia.org/dashboard/db/hhvm-apc-usage at the very least uses the graphite data. We'll need to convert it to use " [puppet] - 10https://gerrit.wikimedia.org/r/415828 (owner: 10Giuseppe Lavagetto) [15:21:19] PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:21:19] RECOVERY - Disk space on kubernetes1001 is OK: DISK OK [15:22:00] PROBLEM - puppet last run on snapshot1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:22:39] RECOVERY - Disk space on kubernetes1004 is OK: DISK OK [15:23:19] PROBLEM - Disk space on kubernetes1003 is CRITICAL: DISK CRITICAL - /var/lib/kubelet/pods/a2105b1b-292d-11e8-b60a-aa0000fe6bdf/volumes/kubernetes.iosecret/default-token-1ls38 is not accessible: Permission denied [15:23:38] 10Operations, 10DBA, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4056717 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dbproxy1011.eqiad.wmnet'] ``` and were **ALL** successful. [15:23:43] (03PS4) 10Mark Bergsma: [WiP] Split off attributes and exceptions from bgp.py into their own modules [debs/pybal] - 10https://gerrit.wikimedia.org/r/416985 [15:35:35] (03PS6) 10Herron: puppet_compiler: add support for puppetdb4 and local postgresql [puppet] - 10https://gerrit.wikimedia.org/r/413881 (https://phabricator.wikimedia.org/T187258) [15:37:52] (03CR) 10Herron: puppet_compiler: add support for puppetdb4 and local postgresql (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/413881 (https://phabricator.wikimedia.org/T187258) (owner: 10Herron) [15:37:59] (03PS1) 10Jcrespo: Revert "cloud-dns: Point wikireplica-web to dbproxy1010" [puppet] - 10https://gerrit.wikimedia.org/r/420051 [15:38:35] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:38:55] (03PS1) 10Andrew Bogott: get_images: decompress compressed image downloads [wikitech-static] - 10https://gerrit.wikimedia.org/r/420052 (https://phabricator.wikimedia.org/T188926) [15:42:15] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: CRITICAL - kubelet_operational_latencies is 66567 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:42:25] RECOVERY - Disk space on kubernetes1003 is OK: DISK OK [15:43:26] PROBLEM - Disk space on kubernetes1001 is CRITICAL: DISK CRITICAL - /var/lib/kubelet/pods/6dbddaf5-2930-11e8-b60a-aa0000fe6bdf/volumes/kubernetes.iosecret/default-token-1ls38 is not accessible: Permission denied [15:43:48] bblack: hi! yt? quick question... How feasible would it be to have trigger an EventLogging event server-side, from Varnish? [15:43:57] 10Operations, 10Discovery-Search: Additional network ports for elasticsearch servers? - https://phabricator.wikimedia.org/T189854#4056759 (10EBernhardson) Talked to a few people about this. While it is could be possible and has been done, sparingly, in the past, it's typically more for redundancy than increase... [15:44:01] (03CR) 10Volans: "Thanks for the fixes. I'm ok with them. See inline for a more detailed reply to one of the points." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/406779 (https://phabricator.wikimedia.org/T185967) (owner: 10Volans) [15:44:04] to have something trigger [15:44:15] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: OK - kubelet_operational_latencies is 7777 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:44:31] AndyRussG: not sure what you mean, do you mean have mediawiki send an eventlogging evnet? [15:44:44] or something inside prod network? rather than on client? [15:45:11] (03PS2) 10Jcrespo: Revert "cloud-dns: Point wikireplica-web to dbproxy1010" [puppet] - 10https://gerrit.wikimedia.org/r/420051 [15:45:25] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: CRITICAL - kubelet_operational_latencies is 25355 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:45:45] PROBLEM - Disk space on kubernetes1004 is CRITICAL: DISK CRITICAL - /var/lib/kubelet/pods/b2a46a26-2930-11e8-b60a-aa0000fe6bdf/volumes/kubernetes.iosecret/default-token-1ls38 is not accessible: Permission denied [15:46:25] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: OK - kubelet_operational_latencies is 1786 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:47:25] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: CRITICAL - kubelet_operational_latencies is 27399 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:48:04] ottomata: hi!!! can't do that because the page is varnish cashed [15:48:09] yes to the second quesiton [15:48:09] (03PS3) 10Bstorm: toolsdb: Remove stale accounts if present in maintain-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/419630 (https://phabricator.wikimedia.org/T188680) [15:48:13] (03PS1) 10Jcrespo: cloud-dns: Point all wikireplicas to dbproxy1011 [puppet] - 10https://gerrit.wikimedia.org/r/420055 (https://phabricator.wikimedia.org/T183249) [15:48:25] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: OK - kubelet_operational_latencies is 8834 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:48:40] ottomata: infact I'm just circling back to something we've discussed before, for getting the donate wiki pageview info into our own kafka topic [15:48:46] RECOVERY - Disk space on kubernetes1004 is OK: DISK OK [15:48:49] (03CR) 10Bstorm: [C: 032] toolsdb: Remove stale accounts if present in maintain-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/419630 (https://phabricator.wikimedia.org/T188680) (owner: 10Bstorm) [15:48:52] (03CR) 10Muehlenhoff: [C: 031] "Ack, I can be around on Monday or alternatively also can take care of merging." [puppet] - 10https://gerrit.wikimedia.org/r/419709 (owner: 10Jcrespo) [15:50:40] (03PS1) 10Rush: openstack: labvirt102[12] as Ubuntu and Liberty [puppet] - 10https://gerrit.wikimedia.org/r/420056 (https://phabricator.wikimedia.org/T187954) [15:50:41] AndyRussG: iiuc, the standard way to do that would be a kafka consumer that reads a topic and emits only the events it cares about to the second topic. [15:51:08] (03CR) 10Volans: "Minor nitpicks inline, looks good otherwise." (033 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/413397 (owner: 10Muehlenhoff) [15:51:25] RECOVERY - Disk space on kubernetes1001 is OK: DISK OK [15:51:32] (03CR) 10Jcrespo: "The drop user is scary- could some sanity checks be added to prevent accidental drop of privileged accounts (root and other admin accounts" [puppet] - 10https://gerrit.wikimedia.org/r/419630 (https://phabricator.wikimedia.org/T188680) (owner: 10Bstorm) [15:52:25] PROBLEM - Disk space on kubernetes1003 is CRITICAL: DISK CRITICAL - /var/lib/kubelet/pods/a847d13c-2931-11e8-b60a-aa0000fe6bdf/volumes/kubernetes.iosecret/default-token-1ls38 is not accessible: Permission denied [15:54:16] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: CRITICAL - kubelet_operational_latencies is 114004 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:54:25] RECOVERY - Disk space on kubernetes1003 is OK: DISK OK [15:55:28] (03CR) 10Bstorm: [C: 032] "> The drop user is scary- could some sanity checks be added to" [puppet] - 10https://gerrit.wikimedia.org/r/419630 (https://phabricator.wikimedia.org/T188680) (owner: 10Bstorm) [15:57:38] (03CR) 10Volans: "FYI I'm running a full puppet compiler run, I'll let you know the results" [puppet] - 10https://gerrit.wikimedia.org/r/419709 (owner: 10Jcrespo) [15:58:27] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: CRITICAL - kubelet_operational_latencies is 98236 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:58:32] (03PS2) 10Rush: openstack: labvirt102[12] as Ubuntu and Liberty [puppet] - 10https://gerrit.wikimedia.org/r/420056 (https://phabricator.wikimedia.org/T187954) [15:58:54] 10Operations: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928#4056796 (10MoritzMuehlenhoff) 05Open>03Resolved Closing [15:59:14] (03PS3) 10Rush: openstack: labvirt102[12] as Ubuntu and Liberty [puppet] - 10https://gerrit.wikimedia.org/r/420056 (https://phabricator.wikimedia.org/T187954) [16:01:27] PROBLEM - Disk space on kubernetes1001 is CRITICAL: DISK CRITICAL - /var/lib/kubelet/pods/e36e1ff8-2932-11e8-b60a-aa0000fe6bdf/volumes/kubernetes.iosecret/default-token-1ls38 is not accessible: Permission denied [16:01:50] (03PS1) 10Jcrespo: Switchover temporarily wikireplica-web to dbproxy1011 [dns] - 10https://gerrit.wikimedia.org/r/420058 (https://phabricator.wikimedia.org/T183249) [16:03:27] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: OK - kubelet_operational_latencies is 1717 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:05:46] (03CR) 10RobH: [C: 031] openstack: labvirt102[12] as Ubuntu and Liberty [puppet] - 10https://gerrit.wikimedia.org/r/420056 (https://phabricator.wikimedia.org/T187954) (owner: 10Rush) [16:06:04] (03CR) 10Rush: [C: 032] openstack: labvirt102[12] as Ubuntu and Liberty [puppet] - 10https://gerrit.wikimedia.org/r/420056 (https://phabricator.wikimedia.org/T187954) (owner: 10Rush) [16:09:31] (03PS1) 10Elukey: aptrepo: add cassandra226 component [puppet] - 10https://gerrit.wikimedia.org/r/420059 [16:09:45] !log Stop MySQL on db1020 - T189773 [16:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:52] T189773: Decommission db1020 - https://phabricator.wikimedia.org/T189773 [16:12:07] PROBLEM - haproxy failover on dbproxy1007 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [16:12:18] PROBLEM - haproxy failover on dbproxy1002 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [16:12:57] (03CR) 10Gehel: [C: 031] "Damn... so many people involved in what look like such a minor change :) Thanks for all the help!" [puppet] - 10https://gerrit.wikimedia.org/r/419709 (owner: 10Jcrespo) [16:13:18] PROBLEM - Request latencies on chlorine is CRITICAL: CRITICAL - apiserver_request_latencies is 600998 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:14:18] RECOVERY - Request latencies on chlorine is OK: OK - apiserver_request_latencies is 5139 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:16:12] (03PS1) 10Rush: openstack: backports setup initial run [puppet] - 10https://gerrit.wikimedia.org/r/420060 (https://phabricator.wikimedia.org/T188266) [16:17:00] we are working on the proxies, sorry [16:17:17] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: OK - kubelet_operational_latencies is 1060 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:17:21] well, on the passive to be decommed server [16:19:07] RECOVERY - haproxy failover on dbproxy1007 is OK: OK check_failover servers up 2 down 0 [16:19:17] RECOVERY - haproxy failover on dbproxy1002 is OK: OK check_failover servers up 2 down 0 [16:19:47] 10Operations, 10Ops-Access-Requests: Requesting deployment access for samwilson - https://phabricator.wikimedia.org/T189414#4056884 (10RobH) a:03Samwilson I went ahead and added the access checklist, and we need to review some of them: @samwilson needs to read and sign the L3 document, all users with shell... [16:20:15] 10Operations, 10Ops-Access-Requests: Requesting deployment access for samwilson - https://phabricator.wikimedia.org/T189414#4056886 (10RobH) [16:21:36] (03CR) 10Jcrespo: [C: 032] Switchover temporarily wikireplica-web to dbproxy1011 [dns] - 10https://gerrit.wikimedia.org/r/420058 (https://phabricator.wikimedia.org/T183249) (owner: 10Jcrespo) [16:22:22] !log installing curl security updates [16:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:52] (03PS2) 10Jcrespo: cloud-dns: Point all wikireplicas to dbproxy1011 [puppet] - 10https://gerrit.wikimedia.org/r/420055 (https://phabricator.wikimedia.org/T183249) [16:23:18] 10Operations, 10Ops-Access-Requests: Requesting deployment access for samwilson - https://phabricator.wikimedia.org/T189414#4056894 (10RobH) [16:23:26] (03CR) 10Jcrespo: [C: 032] cloud-dns: Point all wikireplicas to dbproxy1011 [puppet] - 10https://gerrit.wikimedia.org/r/420055 (https://phabricator.wikimedia.org/T183249) (owner: 10Jcrespo) [16:23:31] (03PS1) 10Marostegui: dbproxy100[2,7]: Change sby host [puppet] - 10https://gerrit.wikimedia.org/r/420061 (https://phabricator.wikimedia.org/T189773) [16:24:10] (03PS2) 10Marostegui: dbproxy100[2,7]: Change sby host [puppet] - 10https://gerrit.wikimedia.org/r/420061 (https://phabricator.wikimedia.org/T189773) [16:25:21] (03PS1) 10Herron: puppetdb_upgrade: point codfw puppet masters to puppetdb2001 [puppet] - 10https://gerrit.wikimedia.org/r/420062 (https://phabricator.wikimedia.org/T177253) [16:25:46] (03CR) 10Muehlenhoff: [C: 031] aptrepo: add cassandra226 component [puppet] - 10https://gerrit.wikimedia.org/r/420059 (owner: 10Elukey) [16:26:18] (03CR) 10Herron: [C: 04-2] "not to be merged until beginning codfw puppetdb upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/420062 (https://phabricator.wikimedia.org/T177253) (owner: 10Herron) [16:26:25] (03PS3) 10Marostegui: dbproxy100[2,7]: Change standby host [puppet] - 10https://gerrit.wikimedia.org/r/420061 (https://phabricator.wikimedia.org/T189773) [16:27:02] 10Operations, 10Ops-Access-Requests, 10Ops-Access-Reviews, 10Patch-For-Review: Requesting access to terbium.eqiad.wmnet for bmansurov - https://phabricator.wikimedia.org/T189285#4056918 (10RobH) [16:27:10] (03CR) 10Jcrespo: [C: 031] dbproxy100[2,7]: Change standby host [puppet] - 10https://gerrit.wikimedia.org/r/420061 (https://phabricator.wikimedia.org/T189773) (owner: 10Marostegui) [16:27:27] 10Operations, 10Ops-Access-Requests, 10Ops-Access-Reviews, 10Patch-For-Review: Requesting access to terbium.eqiad.wmnet for bmansurov - https://phabricator.wikimedia.org/T189285#4037542 (10RobH) [16:27:27] RECOVERY - Disk space on kubernetes1001 is OK: DISK OK [16:28:52] PROBLEM - Disk space on kubernetes1004 is CRITICAL: DISK CRITICAL - /var/lib/kubelet/pods/d077cd4f-2936-11e8-b60a-aa0000fe6bdf/volumes/kubernetes.iosecret/default-token-1ls38 is not accessible: Permission denied [16:29:27] !log updating wikireplica_dns 2/3 [16:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:32] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: CRITICAL - kubelet_operational_latencies is 34099 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:31:58] (03CR) 10Andrew Bogott: [V: 032 C: 032] get_images: decompress compressed image downloads [wikitech-static] - 10https://gerrit.wikimedia.org/r/420052 (https://phabricator.wikimedia.org/T188926) (owner: 10Andrew Bogott) [16:32:48] (03PS1) 10RobH: bmansurov's production and cloud keys match [puppet] - 10https://gerrit.wikimedia.org/r/420064 (https://phabricator.wikimedia.org/T189285) [16:33:32] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: OK - kubelet_operational_latencies is 9284 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:33:44] 10Operations, 10Ops-Access-Requests, 10Ops-Access-Reviews, 10Patch-For-Review: Requesting access to terbium.eqiad.wmnet for bmansurov - https://phabricator.wikimedia.org/T189285#4056941 (10RobH) a:05Vgutierrez>03bmansurov Please note in reviewing open access requests, I went ahead and checked, and the... [16:33:54] (03CR) 10RobH: [C: 032] bmansurov's production and cloud keys match [puppet] - 10https://gerrit.wikimedia.org/r/420064 (https://phabricator.wikimedia.org/T189285) (owner: 10RobH) [16:34:52] RECOVERY - Disk space on kubernetes1004 is OK: DISK OK [16:35:13] (03CR) 10Rush: [C: 031] "small note but seems sane, I didn't test this :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/408864 (https://phabricator.wikimedia.org/T171540) (owner: 10Madhuvishy) [16:35:45] (03CR) 10Rush: [C: 031] "Can you add a Bug: or something where this project was deleted?" [puppet] - 10https://gerrit.wikimedia.org/r/419970 (owner: 10Madhuvishy) [16:36:32] PROBLEM - Disk space on kubernetes1003 is CRITICAL: DISK CRITICAL - /var/lib/kubelet/pods/d1a159c2-2937-11e8-b60a-aa0000fe6bdf/volumes/kubernetes.iosecret/default-token-1ls38 is not accessible: Permission denied [16:38:11] (03PS2) 10Madhuvishy: nfs: Remove config for deleted project wikidata-topicmaps [puppet] - 10https://gerrit.wikimedia.org/r/419970 [16:38:33] PROBLEM - Disk space on kubernetes1001 is CRITICAL: DISK CRITICAL - /var/lib/kubelet/pods/25744a82-2938-11e8-b60a-aa0000fe6bdf/volumes/kubernetes.iosecret/default-token-1ls38 is not accessible: Permission denied [16:38:52] PROBLEM - Disk space on kubernetes1004 is CRITICAL: DISK CRITICAL - /var/lib/kubelet/pods/257427b5-2938-11e8-b60a-aa0000fe6bdf/volumes/kubernetes.iosecret/default-token-1ls38 is not accessible: Permission denied [16:39:50] 10Operations, 10Puppet, 10Patch-For-Review: Upgrade PuppetDB to version 4.4 - https://phabricator.wikimedia.org/T177253#4056981 (10herron) Next week after the codfw puppet masters have been upgraded to stretch I plan to upgrade codfw to puppetdb 4 with this migration plan: # depool codfw puppetmasters (via... [16:40:08] (03CR) 10Madhuvishy: "There seems to be no phab task here, so adding the relevant sal entry" [puppet] - 10https://gerrit.wikimedia.org/r/419970 (owner: 10Madhuvishy) [16:40:17] (03PS3) 10Madhuvishy: nfs: Remove config for deleted project wikidata-topicmaps [puppet] - 10https://gerrit.wikimedia.org/r/419970 [16:40:58] (03CR) 10Madhuvishy: [C: 032] nfs: Remove config for deleted project wikidata-topicmaps [puppet] - 10https://gerrit.wikimedia.org/r/419970 (owner: 10Madhuvishy) [16:43:13] (03CR) 10Eevans: [C: 04-1] "Do we want to include the release version (6) in this? Would we ever want to support more than one 2.2 version?" [puppet] - 10https://gerrit.wikimedia.org/r/420059 (owner: 10Elukey) [16:44:59] (03PS1) 10Elukey: Add labstore100[67] to statistics_servers to allow rsyncs [puppet] - 10https://gerrit.wikimedia.org/r/420066 (https://phabricator.wikimedia.org/T189644) [16:46:00] (03CR) 10Elukey: "> Do we want to include the release version (6) in this? Would we" [puppet] - 10https://gerrit.wikimedia.org/r/420059 (owner: 10Elukey) [16:47:25] 10Operations, 10LDAP: add ssh key comparison to cross-validate-accounts.py - https://phabricator.wikimedia.org/T189890#4056998 (10RobH) p:05Triage>03Normal [16:51:43] (03CR) 10Madhuvishy: nfs-mount-manager: Add option to kill process accessing a mount (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/408864 (https://phabricator.wikimedia.org/T171540) (owner: 10Madhuvishy) [16:53:14] (03PS4) 10Madhuvishy: nfs-mount-manager: Add option to kill process accessing a mount [puppet] - 10https://gerrit.wikimedia.org/r/408864 (https://phabricator.wikimedia.org/T171540) [16:53:50] (03CR) 10Madhuvishy: [C: 032] nfs-mount-manager: Add option to kill process accessing a mount [puppet] - 10https://gerrit.wikimedia.org/r/408864 (https://phabricator.wikimedia.org/T171540) (owner: 10Madhuvishy) [16:57:17] (03PS1) 10Jcrespo: dbproxy-wikirreplicas: Revert to the original proxy configuration [dns] - 10https://gerrit.wikimedia.org/r/420071 (https://phabricator.wikimedia.org/T183249) [16:58:12] (03PS3) 10Jcrespo: Revert "cloud-dns: Point wikireplica-web to dbproxy1010" [puppet] - 10https://gerrit.wikimedia.org/r/420051 [16:58:20] (03Abandoned) 10Jcrespo: Revert "cloud-dns: Point wikireplica-web to dbproxy1010" [puppet] - 10https://gerrit.wikimedia.org/r/420051 (owner: 10Jcrespo) [16:59:38] 10Operations, 10Puppet: Failover puppet ca service from eqiad to codfw - https://phabricator.wikimedia.org/T189891#4057030 (10herron) p:05Triage>03Normal [16:59:51] PROBLEM - puppet last run on labvirt1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:36] 10Operations, 10Puppet: Failover puppet ca service from eqiad to codfw - https://phabricator.wikimedia.org/T189891#4057030 (10herron) Puppet CA failover process for review # Disable puppet across the fleet # Ensure rsync (ca and volatile) destinations are up to date on puppetmaster2001 ## /var/lib/puppet/serv... [17:00:52] PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:52] PROBLEM - puppet last run on analytics1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:52] PROBLEM - puppet last run on cp4025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:11] PROBLEM - puppet last run on restbase1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:21] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:31] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:33] * godog looks at nitrogen sideways [17:01:42] PROBLEM - puppet last run on mw1312 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:42] PROBLEM - puppet last run on boron is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:01:52] PROBLEM - puppet last run on rdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:11] yeah, no recent relevant deploy [17:02:12] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:12] PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:41] PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:41] (03PS1) 10Jcrespo: Revert "cloud-dns: Point wikireplica-web to dbproxy1010" [puppet] - 10https://gerrit.wikimedia.org/r/420073 (https://phabricator.wikimedia.org/T183249) [17:02:58] indeed it restarted itself a fe min ago [17:03:05] few* [17:03:27] good times [17:03:30] (03CR) 10Eevans: [C: 04-1] "> > Do we want to include the release version (6) in this? Would we" [puppet] - 10https://gerrit.wikimedia.org/r/420059 (owner: 10Elukey) [17:03:32] PROBLEM - puppet last run on logstash1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:03:32] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:03:55] is it memory issues or other things, you know? [17:04:01] PROBLEM - puppet last run on conf1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:04:21] PROBLEM - puppet last run on mw1275 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:04:31] PROBLEM - puppet last run on conf1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:04:32] PROBLEM - puppet last run on analytics1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:05:30] (03PS10) 10Madhuvishy: NFS: add custom script to generate target hosts [puppet] - 10https://gerrit.wikimedia.org/r/406779 (https://phabricator.wikimedia.org/T185967) (owner: 10Volans) [17:05:35] 10Operations, 10Traffic: varnish-be: rate of accepted sessions keeps on increasing - https://phabricator.wikimedia.org/T189892#4057061 (10ema) [17:05:41] 10Operations, 10Traffic: varnish-be: rate of accepted sessions keeps on increasing - https://phabricator.wikimedia.org/T189892#4057076 (10ema) p:05Triage>03High [17:05:55] jynus: usually java OOM afaik [17:06:06] and old friend [17:06:09] *an [17:07:04] indeed, though I'm not seeing a heap dump so who knows [17:08:06] also after the puppetdb upgrade happens next week we will have support in puppet master for multiple puppetdb backend servers which in theory will avoid the issue of a single puppetdb instance restarting causing puppet runs in flight (or starting while it’s down) to fail [17:08:20] ha, that is cool [17:08:45] (03CR) 10Madhuvishy: "Yes I think we should leave this as is, I don't think there's any difference in computation if this script does the exclusion or cumin doe" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/406779 (https://phabricator.wikimedia.org/T185967) (owner: 10Volans) [17:09:01] 10Operations, 10Traffic: varnish-be: rate of accepted sessions keeps on increasing - https://phabricator.wikimedia.org/T189892#4057078 (10ema) [17:09:19] 10Operations, 10Traffic: varnish-be: rate of accepted sessions keeps on increasing - https://phabricator.wikimedia.org/T189892#4057061 (10ema) [17:09:25] Hi all, can i ask i question? [17:10:21] (03CR) 10Jcrespo: "For posterity, even if I said on IRC the same, everything that does not start with '^[spu][0-9]'" [puppet] - 10https://gerrit.wikimedia.org/r/419630 (https://phabricator.wikimedia.org/T188680) (owner: 10Bstorm) [17:10:41] Hi Ahmed123. Yes you can. [17:10:46] (03PS1) 10Jayprakash12345: Enable mapframe on knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420077 (https://phabricator.wikimedia.org/T189883) [17:10:49] (03PS2) 10Elukey: aptrepo: add cassandra22 component [puppet] - 10https://gerrit.wikimedia.org/r/420059 [17:11:37] (03CR) 10Elukey: "My bad, I was confused by cassandra 311, that is not 3.1.1 but 3.11.x, so I'd say that cassandra22 is more consistent :)" [puppet] - 10https://gerrit.wikimedia.org/r/420059 (owner: 10Elukey) [17:11:39] (03CR) 10Eevans: [C: 031] aptrepo: add cassandra22 component [puppet] - 10https://gerrit.wikimedia.org/r/420059 (owner: 10Elukey) [17:11:59] Thanks Dereckson, I uploaded a patch to enable rollback group in arwikiquote and then i scheduled the patch in the Deployments page, so Is there any thing else needed from my side ? Thanks [17:13:03] Ahmed123: normally you, or someone on your behalf is required to be at the time of the deploy and check the change worked as expected [17:13:06] (03PS1) 10Madhuvishy: analytics: Allow labstore1006|7 to rsync from stat* [puppet] - 10https://gerrit.wikimedia.org/r/420078 (https://phabricator.wikimedia.org/T188726) [17:13:15] Ahmed123: yes, two things: 1. Be here at this time 2. Be ready to test if your patch works fine, is here if the group is well added. [17:13:25] (03PS2) 10Andrew Bogott: Rename role::mariadb::wikitech to role::mariadb::labtestwikitech [puppet] - 10https://gerrit.wikimedia.org/r/419736 [17:13:28] As jynus said. [17:13:45] I was going to say, what Dereckson as that is more authoritative [17:13:51] :-) [17:14:07] ok thanks very much [17:14:17] (03CR) 10Elukey: [C: 031] "I just filed another identical code review, will abandon mine :)" [puppet] - 10https://gerrit.wikimedia.org/r/420078 (https://phabricator.wikimedia.org/T188726) (owner: 10Madhuvishy) [17:14:24] (03Abandoned) 10Elukey: Add labstore100[67] to statistics_servers to allow rsyncs [puppet] - 10https://gerrit.wikimedia.org/r/420066 (https://phabricator.wikimedia.org/T189644) (owner: 10Elukey) [17:14:33] (03CR) 10Andrew Bogott: [C: 032] Rename role::mariadb::wikitech to role::mariadb::labtestwikitech [puppet] - 10https://gerrit.wikimedia.org/r/419736 (owner: 10Andrew Bogott) [17:14:44] Ahmed123: if this is your first patch, you may want to check the mwdebug testing documentation [17:14:57] (03PS2) 10Madhuvishy: analytics: Allow labstore1006|7 to rsync from stat* [puppet] - 10https://gerrit.wikimedia.org/r/420078 (https://phabricator.wikimedia.org/T188726) [17:15:45] (03CR) 10Madhuvishy: [C: 032] analytics: Allow labstore1006|7 to rsync from stat* [puppet] - 10https://gerrit.wikimedia.org/r/420078 (https://phabricator.wikimedia.org/T188726) (owner: 10Madhuvishy) [17:16:04] Ahmed123: in a nutshell, patch will be first deployed to a test server, so you can install an extension to your browser to redirect your requests to that server. [17:16:16] Thanks for your contribution by the way. [17:16:28] I was trying to search for the right documentation [17:17:29] 10Operations, 10Ops-Access-Requests, 10Analytics, 10Research, and 2 others: Restricting access for a collaboration nearing completion - https://phabricator.wikimedia.org/T189341#4039520 (10RobH) a:03DarTar This seems to be awaiting @dartar to confirm that the list (initially including just @Michele.tizzo... [17:17:48] (03PS2) 10Rush: wip: openstack: neutron component comments [puppet] - 10https://gerrit.wikimedia.org/r/420043 [17:18:30] I found it https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug [17:18:44] ok, thanks all for the help i'll try what you said ^_^ [17:19:21] 10Operations, 10Traffic: varnish-be: rate of accepted sessions keeps on increasing - https://phabricator.wikimedia.org/T189892#4057116 (10ema) [17:19:22] PROBLEM - Request latencies on chlorine is CRITICAL: CRITICAL - apiserver_request_latencies is 237630184 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:21:32] (03PS4) 10Andrew Bogott: Add Chicocvenancio's key for Cloud Services [labs/private] - 10https://gerrit.wikimedia.org/r/405376 (https://phabricator.wikimedia.org/T185273) (owner: 10Chico Venancio) [17:23:32] 10Operations, 10Ops-Access-Requests, 10Analytics, 10Research, and 2 others: Restricting access for a collaboration nearing completion - https://phabricator.wikimedia.org/T189341#4057125 (10DarTar) @RobH correct, it should apply to all four users, thanks for catching this. [17:23:42] 10Operations, 10Cassandra, 10Services (doing), 10User-Eevans, 10User-Elukey: Test/upload new cassandra 2.2.6 package (wmf3) - https://phabricator.wikimedia.org/T189529#4057127 (10elukey) [17:24:22] RECOVERY - Request latencies on chlorine is OK: OK - apiserver_request_latencies is 4349 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:27:00] (03PS1) 10Jcrespo: dbproxy: Update dbproxy1010 to the latest socket path and config [puppet] - 10https://gerrit.wikimedia.org/r/420079 (https://phabricator.wikimedia.org/T183249) [17:27:16] (03PS2) 10Jcrespo: dbproxy: Update dbproxy1010 to the latest socket path and config [puppet] - 10https://gerrit.wikimedia.org/r/420079 (https://phabricator.wikimedia.org/T183249) [17:27:55] (03CR) 10Jcrespo: [C: 032] dbproxy: Update dbproxy1010 to the latest socket path and config [puppet] - 10https://gerrit.wikimedia.org/r/420079 (https://phabricator.wikimedia.org/T183249) (owner: 10Jcrespo) [17:28:32] RECOVERY - puppet last run on logstash1005 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [17:28:32] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [17:29:01] RECOVERY - puppet last run on conf1002 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [17:29:21] RECOVERY - puppet last run on mw1275 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [17:29:31] RECOVERY - puppet last run on conf1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:29:32] RECOVERY - puppet last run on analytics1065 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:29:51] RECOVERY - puppet last run on labvirt1012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:30:52] RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:30:52] RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:30:52] RECOVERY - puppet last run on cp4025 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:31:11] RECOVERY - puppet last run on restbase1013 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:31:22] PROBLEM - Request latencies on chlorine is CRITICAL: CRITICAL - apiserver_request_latencies is 70824715 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:31:22] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:31:32] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:31:42] RECOVERY - puppet last run on mw1312 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:31:42] RECOVERY - puppet last run on boron is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:31:52] RECOVERY - puppet last run on rdb1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:32:12] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:32:12] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:32:41] RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:32:57] !log reimage dbproxy1010 [17:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:17] 10Operations, 10DBA, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4057152 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['dbproxy1010.eqiad.wmnet'] ``` The log can be found in `/var/... [17:34:25] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4057153 (10Anomie) >>! In T133410#4055687, @Tgr wrote: > ** On Parsoid (currently this means VE, Flow, the mobile apps and whatever third... [17:34:33] 10Operations, 10Ops-Access-Requests, 10Ops-Access-Reviews, 10Patch-For-Review: Requesting access to terbium.eqiad.wmnet for bmansurov - https://phabricator.wikimedia.org/T189285#4057154 (10bmansurov) @RobH thanks for the email with detailed instructions. Here's my new production key: ``` ssh-rsa AAAAB3NzaC... [17:36:22] RECOVERY - Request latencies on chlorine is OK: OK - apiserver_request_latencies is 4240 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:36:26] (03PS1) 10RobH: restoring bmansurov shell access [puppet] - 10https://gerrit.wikimedia.org/r/420080 (https://phabricator.wikimedia.org/T189285) [17:37:44] (03CR) 10RobH: [C: 032] restoring bmansurov shell access [puppet] - 10https://gerrit.wikimedia.org/r/420080 (https://phabricator.wikimedia.org/T189285) (owner: 10RobH) [17:39:27] (03CR) 10Muehlenhoff: [C: 031] "Ok :-)" [puppet] - 10https://gerrit.wikimedia.org/r/420059 (owner: 10Elukey) [17:39:30] 10Operations, 10Ops-Access-Requests, 10Ops-Access-Reviews, 10Patch-For-Review: Requesting access to terbium.eqiad.wmnet for bmansurov - https://phabricator.wikimedia.org/T189285#4057165 (10RobH) I've restored your shell access, however we'll still need to work on the access request expansion requested on t... [17:40:38] (03PS2) 10Madhuvishy: WIP: nfsclient: Setup dumps mounts from new servers [puppet] - 10https://gerrit.wikimedia.org/r/403767 (https://phabricator.wikimedia.org/T171540) [17:41:22] PROBLEM - Request latencies on chlorine is CRITICAL: CRITICAL - apiserver_request_latencies is 47295571 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:52:48] (03PS1) 10Dzahn: introduce vega.codfw.wmnet (bromine equivalent) [dns] - 10https://gerrit.wikimedia.org/r/420082 (https://phabricator.wikimedia.org/T188163) [17:55:14] (03PS1) 10Madhuvishy: [WIP] statistics: Migrate dumps mount to labstore1006|7 [puppet] - 10https://gerrit.wikimedia.org/r/420083 (https://phabricator.wikimedia.org/T188644) [17:56:45] (03PS3) 10Rush: openstack: neutron component annotations [puppet] - 10https://gerrit.wikimedia.org/r/420043 [17:58:06] (03PS4) 10Rush: openstack: neutron component annotations [puppet] - 10https://gerrit.wikimedia.org/r/420043 [17:58:46] (03CR) 10Rush: [C: 032] openstack: neutron component annotations [puppet] - 10https://gerrit.wikimedia.org/r/420043 (owner: 10Rush) [17:58:51] (03PS2) 10Madhuvishy: [WIP] statistics: Migrate dumps mount to labstore1006|7 [puppet] - 10https://gerrit.wikimedia.org/r/420083 (https://phabricator.wikimedia.org/T188644) [18:02:27] (03PS1) 10Rush: openstack: labtestn pass in dhcp_domain for dhcp-agent [puppet] - 10https://gerrit.wikimedia.org/r/420084 (https://phabricator.wikimedia.org/T187954) [18:03:31] RECOVERY - Request latencies on chlorine is OK: OK - apiserver_request_latencies is 4389 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:04:06] (03CR) 10Rush: "labtestcontrol2003.wikimedia.org,labtestneutron2001.codfw.wmnet,labtestvirt2003.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/420084 (https://phabricator.wikimedia.org/T187954) (owner: 10Rush) [18:06:12] (03CR) 10Jcrespo: [C: 032] dbproxy-wikirreplicas: Revert to the original proxy configuration [dns] - 10https://gerrit.wikimedia.org/r/420071 (https://phabricator.wikimedia.org/T183249) (owner: 10Jcrespo) [18:07:25] 10Operations, 10DBA, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4057260 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dbproxy1010.eqiad.wmnet'] ``` and were **ALL** successful. [18:08:59] (03PS2) 10Jcrespo: Revert "cloud-dns: Point wikireplica-web to dbproxy1010" [puppet] - 10https://gerrit.wikimedia.org/r/420073 (https://phabricator.wikimedia.org/T183249) [18:09:30] (03CR) 10Jcrespo: [C: 032] Revert "cloud-dns: Point wikireplica-web to dbproxy1010" [puppet] - 10https://gerrit.wikimedia.org/r/420073 (https://phabricator.wikimedia.org/T183249) (owner: 10Jcrespo) [18:10:03] (03PS3) 10Madhuvishy: [WIP] statistics: Migrate dumps mount to labstore1006|7 [puppet] - 10https://gerrit.wikimedia.org/r/420083 (https://phabricator.wikimedia.org/T188644) [18:10:38] (03PS1) 10Rush: openstack: neutron-server ferm rule for default port [puppet] - 10https://gerrit.wikimedia.org/r/420085 (https://phabricator.wikimedia.org/T187954) [18:12:04] (03PS2) 10Rush: openstack: labtestn pass in dhcp_domain for dhcp-agent [puppet] - 10https://gerrit.wikimedia.org/r/420084 (https://phabricator.wikimedia.org/T187954) [18:12:07] (03PS2) 10Rush: openstack: neutron-server ferm rule for default port [puppet] - 10https://gerrit.wikimedia.org/r/420085 (https://phabricator.wikimedia.org/T187954) [18:13:05] !log switching back wikireplica cloud dns to the original config [18:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:20] 10Operations: VM request for bromine (misc_static_sites) equivalent - https://phabricator.wikimedia.org/T189899#4057273 (10Dzahn) [18:14:38] 10Operations, 10vm-requests: VM request for bromine (misc_static_sites) equivalent - https://phabricator.wikimedia.org/T189899#4057288 (10Dzahn) [18:15:12] (03PS2) 10Dzahn: introduce vega.codfw.wmnet (bromine equivalent) [dns] - 10https://gerrit.wikimedia.org/r/420082 (https://phabricator.wikimedia.org/T188163) [18:15:18] (03PS3) 10Dzahn: introduce vega.codfw.wmnet (bromine equivalent) [dns] - 10https://gerrit.wikimedia.org/r/420082 (https://phabricator.wikimedia.org/T188163) [18:15:27] (03CR) 10Rush: [C: 032] openstack: labtestn pass in dhcp_domain for dhcp-agent [puppet] - 10https://gerrit.wikimedia.org/r/420084 (https://phabricator.wikimedia.org/T187954) (owner: 10Rush) [18:16:08] (03CR) 10Rush: [C: 032] openstack: neutron-server ferm rule for default port [puppet] - 10https://gerrit.wikimedia.org/r/420085 (https://phabricator.wikimedia.org/T187954) (owner: 10Rush) [18:17:18] (03PS2) 10Rush: openstack: glance bootstrapping with debian image [puppet] - 10https://gerrit.wikimedia.org/r/419871 (https://phabricator.wikimedia.org/T188266) [18:18:14] (03CR) 10Rush: [C: 032] openstack: glance bootstrapping with debian image [puppet] - 10https://gerrit.wikimedia.org/r/419871 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [18:19:08] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4057309 (10Tgr) That depends. Try something like this: ```

10Operations, 10DBA, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4057377 (10Marostegui) Very nice work! :-) [18:50:22] (03PS3) 10Andrew Bogott: californium: mark as spare system [puppet] - 10https://gerrit.wikimedia.org/r/419534 (https://phabricator.wikimedia.org/T168470) [18:51:19] (03CR) 10Andrew Bogott: [C: 032] californium: mark as spare system [puppet] - 10https://gerrit.wikimedia.org/r/419534 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [19:17:32] PROBLEM - Request latencies on chlorine is CRITICAL: CRITICAL - apiserver_request_latencies is 98300711 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:19:33] RECOVERY - Request latencies on chlorine is OK: OK - apiserver_request_latencies is 5718 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:21:42] PROBLEM - Request latencies on argon is CRITICAL: CRITICAL - apiserver_request_latencies is 20815906 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:25:42] RECOVERY - Request latencies on argon is OK: OK - apiserver_request_latencies is 5970 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:28:04] (03PS1) 10Bstorm: toolsdb: include failsafe against removing admin accounts [puppet] - 10https://gerrit.wikimedia.org/r/420114 (https://phabricator.wikimedia.org/T188680) [19:37:42] PROBLEM - Request latencies on argon is CRITICAL: CRITICAL - apiserver_request_latencies is 85467581 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:39:16] (03CR) 10BryanDavis: [C: 031] toolsdb: include failsafe against removing admin accounts [puppet] - 10https://gerrit.wikimedia.org/r/420114 (https://phabricator.wikimedia.org/T188680) (owner: 10Bstorm) [19:39:54] (03PS5) 10Mark Bergsma: Split off attributes and exceptions from bgp.py into their own modules [debs/pybal] - 10https://gerrit.wikimedia.org/r/416985 [19:42:32] PROBLEM - Request latencies on chlorine is CRITICAL: CRITICAL - apiserver_request_latencies is 390518965 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:43:37] (03CR) 10Mark Bergsma: [C: 031] Split off attributes and exceptions from bgp.py into their own modules [debs/pybal] - 10https://gerrit.wikimedia.org/r/416985 (owner: 10Mark Bergsma) [19:49:42] RECOVERY - Request latencies on argon is OK: OK - apiserver_request_latencies is 6975 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:54:33] RECOVERY - Request latencies on chlorine is OK: OK - apiserver_request_latencies is 4496 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:54:42] PROBLEM - Request latencies on argon is CRITICAL: CRITICAL - apiserver_request_latencies is 123613210 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:57:45] (03PS1) 10Gergő Tisza: Allow protocol-relative URLs in TemplateStyles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420115 (https://phabricator.wikimedia.org/T188760) [20:01:42] RECOVERY - Request latencies on argon is OK: OK - apiserver_request_latencies is 6225 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:03:33] PROBLEM - Request latencies on chlorine is CRITICAL: CRITICAL - apiserver_request_latencies is 24494588 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:06:42] PROBLEM - Request latencies on argon is CRITICAL: CRITICAL - apiserver_request_latencies is 21370894 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:07:42] RECOVERY - Request latencies on argon is OK: OK - apiserver_request_latencies is 5549 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:11:32] RECOVERY - Request latencies on chlorine is OK: OK - apiserver_request_latencies is 4321 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:15:14] (03CR) 10Anomie: "This change looks like it will do what it intends to do. Haven't tested." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420115 (https://phabricator.wikimedia.org/T188760) (owner: 10Gergő Tisza) [20:18:41] (03PS1) 10Ottomata: Exclude change-prop topics from main -> jumbo MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/420117 (https://phabricator.wikimedia.org/T189464) [20:22:08] (03CR) 10Ottomata: [C: 032] Exclude change-prop topics from main -> jumbo MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/420117 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [20:30:43] PROBLEM - Ubuntu mirror in sync with upstream on sodium is CRITICAL: /srv/mirrors/ubuntu is over 14 hours old. [20:31:32] PROBLEM - Request latencies on chlorine is CRITICAL: CRITICAL - apiserver_request_latencies is 48246944 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:31:43] PROBLEM - Request latencies on argon is CRITICAL: CRITICAL - apiserver_request_latencies is 85340768 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:32:32] RECOVERY - Request latencies on chlorine is OK: OK - apiserver_request_latencies is 5020 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:33:43] RECOVERY - Request latencies on argon is OK: OK - apiserver_request_latencies is 5736 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:38:42] PROBLEM - Request latencies on argon is CRITICAL: CRITICAL - apiserver_request_latencies is 83125522 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:42:22] (03PS1) 10Mark Bergsma: Fix Attribute.__eq__ and .__ne__ [debs/pybal] - 10https://gerrit.wikimedia.org/r/420119 [20:42:24] (03PS1) 10Mark Bergsma: Fix MPReachNLRIAttribute AFI_INET construction from tuple [debs/pybal] - 10https://gerrit.wikimedia.org/r/420120 [20:43:42] RECOVERY - Request latencies on argon is OK: OK - apiserver_request_latencies is 6201 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:02:04] (03PS1) 10Rush: openstack: labtestn values for ml2 and linuxbridge setup [puppet] - 10https://gerrit.wikimedia.org/r/420121 (https://phabricator.wikimedia.org/T188266) [21:03:03] (03CR) 10Rush: [C: 032] openstack: labtestn values for ml2 and linuxbridge setup [puppet] - 10https://gerrit.wikimedia.org/r/420121 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [21:03:11] (03PS2) 10Rush: openstack: labtestn values for ml2 and linuxbridge setup [puppet] - 10https://gerrit.wikimedia.org/r/420121 (https://phabricator.wikimedia.org/T188266) [21:05:53] (03PS1) 10Urbanecm: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420122 (https://phabricator.wikimedia.org/T189778) [21:12:45] (03PS1) 10Rush: openstack: pass down network_flat_name variable [puppet] - 10https://gerrit.wikimedia.org/r/420124 (https://phabricator.wikimedia.org/T188266) [21:13:34] (03CR) 10Rush: [C: 032] openstack: pass down network_flat_name variable [puppet] - 10https://gerrit.wikimedia.org/r/420124 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [21:39:42] (03CR) 10Catrope: "It's simpler to get rid of the computed dblist and just have a single list of Flow wikis. We don't actually need the public/private separa" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416217 (https://phabricator.wikimedia.org/T188812) (owner: 10Nemo bis) [22:12:29] 10Operations, 10Gerrit, 10Release-Engineering-Team (Someday): Setup reply emails for gerrit - https://phabricator.wikimedia.org/T158915#4057775 (10Paladox) [22:15:42] (03CR) 10Nemo bis: "Good. I thought you'd prefer to do that in a separate patch but all the better. Having a flow.dblist rather than InitialiseSettings indivi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416217 (https://phabricator.wikimedia.org/T188812) (owner: 10Nemo bis) [22:16:33] (03PS2) 10Bstorm: toolsdb: include failsafe against removing admin accounts [puppet] - 10https://gerrit.wikimedia.org/r/420114 (https://phabricator.wikimedia.org/T188680) [22:17:26] (03PS3) 10Bstorm: toolsdb: include failsafe against removing admin accounts [puppet] - 10https://gerrit.wikimedia.org/r/420114 (https://phabricator.wikimedia.org/T188680) [22:27:01] (03PS4) 10Catrope: Switch public wikis to explicit Flow usage definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416217 (https://phabricator.wikimedia.org/T188812) (owner: 10Nemo bis) [22:28:22] (03CR) 10jerkins-bot: [V: 04-1] Switch public wikis to explicit Flow usage definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416217 (https://phabricator.wikimedia.org/T188812) (owner: 10Nemo bis) [22:29:58] (03CR) 10Catrope: "Yes, flow.dblist (and flow-labs.dblist, which is computed and not used by the config itself) are there so that we can use foreachwikiindbl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416217 (https://phabricator.wikimedia.org/T188812) (owner: 10Nemo bis) [22:30:59] (03PS5) 10Catrope: Switch public wikis to explicit Flow usage definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416217 (https://phabricator.wikimedia.org/T188812) (owner: 10Nemo bis) [22:35:29] (03PS6) 10Catrope: Switch public wikis to explicit Flow usage definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416217 (https://phabricator.wikimedia.org/T188812) (owner: 10Nemo bis) [22:49:53] (03PS1) 10Catrope: Enable $wgFlowReadOnly on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420131 (https://phabricator.wikimedia.org/T186463) [22:59:29] 10Operations, 10Patch-For-Review: create codfw-equivalent of bromine, make webserver_misc_static active/active in misc varnish - https://phabricator.wikimedia.org/T188163#4057815 (10Dzahn) [22:59:54] 10Operations, 10Patch-For-Review: create codfw-equivalent of bromine, make webserver_misc_static active/active in misc varnish - https://phabricator.wikimedia.org/T188163#3998458 (10Dzahn) [23:00:41] 10Operations: upgrade bromine to stretch / reinstall - https://phabricator.wikimedia.org/T189910#4057820 (10Dzahn) [23:01:24] 10Operations, 10Patch-For-Review: create codfw-equivalent of bromine, make webserver_misc_static active/active in misc varnish - https://phabricator.wikimedia.org/T188163#3998458 (10Dzahn) p:05Triage>03Normal [23:03:50] 10Operations, 10Patch-For-Review: create codfw-equivalent of bromine, make webserver_misc_static active/active in misc varnish - https://phabricator.wikimedia.org/T188163#4057836 (10Dzahn) [23:03:54] 10Operations, 10vm-requests, 10Patch-For-Review: VM request for bromine (misc_static_sites) equivalent - https://phabricator.wikimedia.org/T189899#4057834 (10Dzahn) 05Open>03Resolved - Instance name: vega.codfw.wmnet Creation time: 2018-03-16 18:42:34 Nodes: - primary: ganeti2001.codfw.wmnet... [23:04:55] 10Operations: upgrade bromine to stretch / reinstall - https://phabricator.wikimedia.org/T189910#4057837 (10Dzahn) also: disk space is running low and it will grow over time. maybe add another disk or replace it with a new VM with 30G instead of only 20G (and that will match vega.codfw.wmnet) [23:05:52] 10Operations, 10HHVM, 10User-Elukey: ICU 57 migration for wikis using non-default collation - https://phabricator.wikimedia.org/T189295#4057840 (10Joe) See https://phabricator.wikimedia.org/T86096#2329554 and https://phabricator.wikimedia.org/T86096#2326032 as methods to evaluate run times. We should also te... [23:08:41] (03PS1) 10Dzahn: DHCP/netboot: add vega.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/420132 (https://phabricator.wikimedia.org/T188163) [23:10:12] (03CR) 10Dzahn: [C: 032] DHCP/netboot: add vega.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/420132 (https://phabricator.wikimedia.org/T188163) (owner: 10Dzahn) [23:32:14] !log signing puppet cert for vega.codfw.wmnet, initial puppet run after fresh stretch install (T188163) [23:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:21] T188163: create codfw-equivalent of bromine, make webserver_misc_static active/active in misc varnish - https://phabricator.wikimedia.org/T188163 [23:36:15] (03PS1) 10Dzahn: site/webserver_misc_static: add vega as codfw node [puppet] - 10https://gerrit.wikimedia.org/r/420134 (https://phabricator.wikimedia.org/T188163) [23:39:31] (03CR) 10Dzahn: [C: 032] site/webserver_misc_static: add vega as codfw node [puppet] - 10https://gerrit.wikimedia.org/r/420134 (https://phabricator.wikimedia.org/T188163) (owner: 10Dzahn) [23:49:12] 10Operations, 10Patch-For-Review: create codfw-equivalent of bromine, make webserver_misc_static active/active in misc varnish - https://phabricator.wikimedia.org/T188163#4057865 (10Dzahn) [23:49:50] 10Operations, 10Patch-For-Review: create codfw-equivalent of bromine, make webserver_misc_static active/active in misc varnish - https://phabricator.wikimedia.org/T188163#3998458 (10Dzahn)