[00:05:07] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 43 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [00:08:23] !log krinkle@deploy1001 Synchronized wmf-config/: I79fb3d194a: add env.php file (not yet used) (duration: 00m 50s) [00:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:26] RECOVERY - Device not healthy -SMART- on dbstore1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=dbstore1002&var-datasource=eqiad%2520prometheus%252Fops [00:45:22] !log krinkle@deploy1001 Synchronized multiversion/MWRealm.php: I79fb3d194a58: use env.php (duration: 00m 49s) [00:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:41:06] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 34 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [01:48:26] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 43 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [02:28:56] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [02:36:07] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 40 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [03:14:02] !log kartik@deploy1001 Started deploy [cxserver/deploy@5a70ef1]: Update cxserver to 47a864b (T205420, T203077, T205700, T205616) [03:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:14:10] T205420: CXServer fails with HTTP 500: "MT processing error: undefined" - https://phabricator.wikimedia.org/T205420 [03:14:11] T205700: Apertium fails to translate a quote template - https://phabricator.wikimedia.org/T205700 [03:14:11] T205616: CX2: Section alignment broken when translating "Fleming: The Man Who Would Be Bond" from Hebrew - https://phabricator.wikimedia.org/T205616 [03:14:12] T203077: Performance analysis for translate API - https://phabricator.wikimedia.org/T203077 [03:18:46] !log kartik@deploy1001 Finished deploy [cxserver/deploy@5a70ef1]: Update cxserver to 47a864b (T205420, T203077, T205700, T205616) (duration: 04m 44s) [03:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:35:16] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 806.54 seconds [04:17:27] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 29 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [04:24:37] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 51 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [04:34:06] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 265.19 seconds [04:49:56] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 32 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [04:57:16] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 57 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [05:02:17] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 29 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [05:08:08] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T206965 (10Marostegui) [05:10:43] (03PS1) 10Marostegui: db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467249 (https://phabricator.wikimedia.org/T206743) [05:11:23] !log Stop MySQL on db1116:3318 to use it to clone db1109 [05:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:15] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467249 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui) [05:13:10] marostegui: I'm updating cxserver, is that OK to do (saw DB related stuffs going on now)? [05:13:18] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467249 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui) [05:13:31] kart_: yep, feel free to go [05:14:25] Thanks! [05:14:36] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 47 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [05:14:43] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1109 (duration: 00m 50s) [05:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:12] !log Stop MySQL on db1109 for recloning - T206743 [05:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:15] T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 [05:16:28] !log kartik@deploy1001 Started deploy [cxserver/deploy@fd74c3b]: Update cxserver to b51f363 (T203077, T99934, T203550) [05:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:35] T99934: CX2: ContentTranslation adapts some references only partially - https://phabricator.wikimedia.org/T99934 [05:16:36] T203550: CX2: Reference template adapted as empty despite templateData - https://phabricator.wikimedia.org/T203550 [05:16:36] T203077: Performance analysis for translate API - https://phabricator.wikimedia.org/T203077 [05:18:17] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467249 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui) [05:20:08] (03PS6) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikinews.org [puppet] - 10https://gerrit.wikimedia.org/r/462480 (https://phabricator.wikimedia.org/T196968) [05:20:53] !log kartik@deploy1001 Finished deploy [cxserver/deploy@fd74c3b]: Update cxserver to b51f363 (T203077, T99934, T203550) (duration: 04m 25s) [05:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:21] 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10Wikidata, and 4 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10Marostegui) >>! In T205865#4657924, @hoo wrote: > ```wikiadmin@db1109(wikidatawiki)> SELECT * FROM in... [05:33:26] (03CR) 10Marostegui: [C: 04-1] "Missing the IP entry on db-eqiad.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466847 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek) [05:34:06] (03CR) 10Marostegui: [C: 04-1] mariadb: reimage db2096 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/466856 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek) [05:34:34] (03CR) 10Marostegui: [C: 04-1] mariadb: productionize db2096 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466847 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek) [05:35:36] (03CR) 10Marostegui: [C: 04-1] mariadb: productionize db2096 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/466846 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek) [05:39:00] 10Operations, 10DBA, 10MediaWiki-Database, 10Performance-Team, and 2 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) @Banyek please double check the key purge has finished on mwaint1002 and keep on with the rest of pending things to do here. Probably... [05:40:27] PROBLEM - SSH cp5004.mgmt on cp5004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:04:49] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/12894/mwdebug1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/462480 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [06:06:42] 10Operations, 10ops-eqiad, 10netops: asw2-a-eqiad FPC7 faulty PEM0 - https://phabricator.wikimedia.org/T206972 (10ayounsi) p:05Triage>03High [06:07:18] good morning marostegui! (re: dbstore1002's task :P) [06:08:07] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [06:10:16] elukey: happy monday XD [06:10:17] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [06:10:17] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [06:12:33] (03CR) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikisource.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/462486 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [06:17:27] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 43 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [06:18:05] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1002/12895/mwdebug1001.eqiad.wmnet/ LGTM but I'd prefer someone else to check too." [puppet] - 10https://gerrit.wikimedia.org/r/462486 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [06:20:12] (03CR) 10Muehlenhoff: [C: 031] "Looks good" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/462486 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [06:22:22] 10Operations, 10Core Platform Team Backlog (Watching / External), 10HHVM, 10Patch-For-Review, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Legoktm) >>! In T176370#4650841, @Smalyshev wrote: >> do you have an idea on how different 7.3 is from 7.2 > > Shoul... [06:25:01] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467252 (https://phabricator.wikimedia.org/T206743) [06:25:42] 10Operations, 10Core Platform Team Backlog (Watching / External), 10HHVM, 10Patch-For-Review, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Smalyshev) > I know it's currently in a RC state, what are the expected changes until the actual release? No change... [06:25:54] (03PS4) 10Muehlenhoff: mediawiki::web::prod_sites: convert wikibooks.org [puppet] - 10https://gerrit.wikimedia.org/r/462487 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [06:26:20] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::web::prod_sites: convert wikibooks.org [puppet] - 10https://gerrit.wikimedia.org/r/462487 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [06:29:18] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:30:36] PROBLEM - puppet last run on cp1080 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/interface-rps] [06:31:28] (03PS3) 10Elukey: Refactor type Systemd::Timer::DateTime to include more normal forms [puppet] - 10https://gerrit.wikimedia.org/r/465630 (https://phabricator.wikimedia.org/T172532) [06:31:36] PROBLEM - puppet last run on mw1305 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:32:09] (03CR) 10Elukey: [C: 032] "Thanks for the review! Sure I'll try to add some rspec tests, if I fail miserably I'll open a task :)" [puppet] - 10https://gerrit.wikimedia.org/r/465630 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [06:32:46] PROBLEM - puppet last run on analytics1071 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/bash/puppet-common.sh] [06:35:55] 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10Wikidata, and 4 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10Marostegui) @hoo after db1109 has been recloned (and now has compressed tables): ``` root@db1109.eqia... [06:37:17] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/466904 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [06:39:07] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467252 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui) [06:39:41] 10Operations, 10Traffic: Power incident in eqsin - https://phabricator.wikimedia.org/T206861 (10ayounsi) Confirmed that all the network devices are back to a healthy state. And we received a completion notice, should be safe to repool the site. >>! In T206861#4664498, @faidon wrote: > - How come the bottom ha... [06:39:42] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467252 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui) [06:40:48] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1109 (duration: 00m 50s) [06:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:56] (03CR) 10Muehlenhoff: [C: 04-1] "Thumbor still uses the memcached collector" [puppet] - 10https://gerrit.wikimedia.org/r/466907 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [06:45:21] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467257 [06:46:54] (03CR) 10Muehlenhoff: [C: 04-1] "This collector seems to be in active use on the "Phabricator" dashboard and doesn't seem to have a replacement yet?" [puppet] - 10https://gerrit.wikimedia.org/r/466988 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [06:47:20] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467257 (owner: 10Marostegui) [06:47:42] (03CR) 10Muehlenhoff: [C: 04-1] "See comment at 466988" [puppet] - 10https://gerrit.wikimedia.org/r/466989 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [06:47:57] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467252 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui) [06:48:24] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467257 (owner: 10Marostegui) [06:48:37] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467257 (owner: 10Marostegui) [06:49:19] (03CR) 10Ayounsi: [C: 032] Add v6 OOB IP for mr1-ulsfo [dns] - 10https://gerrit.wikimedia.org/r/466783 (https://phabricator.wikimedia.org/T206778) (owner: 10Ayounsi) [06:49:21] (03CR) 10ArielGlenn: [C: 031] "That covers all the dump roles on the snapshot hosts, but not the dumpsdata or labstore web servers. Did you want those too?" [puppet] - 10https://gerrit.wikimedia.org/r/466904 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [06:49:25] (03CR) 10Ayounsi: [C: 032] Icinga, add mr1-ulsfo IPv6 OOB [puppet] - 10https://gerrit.wikimedia.org/r/466787 (https://phabricator.wikimedia.org/T206778) (owner: 10Ayounsi) [06:49:33] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase weight for db1109 (duration: 00m 49s) [06:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:49] (03PS2) 10Ayounsi: Add v6 OOB IP for mr1-ulsfo [dns] - 10https://gerrit.wikimedia.org/r/466783 (https://phabricator.wikimedia.org/T206778) [06:50:33] (03CR) 10Muehlenhoff: [C: 031] "Looks good, but before merging make sure all the invidual service dashboards have been updated." [puppet] - 10https://gerrit.wikimedia.org/r/466906 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [06:51:59] 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10Wikidata, and 4 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10Addshore) As some of the pooled hosts has already been reimaged we can already see the production dis... [06:52:08] (03PS2) 10Ayounsi: Icinga, add mr1-ulsfo IPv6 OOB [puppet] - 10https://gerrit.wikimedia.org/r/466787 (https://phabricator.wikimedia.org/T206778) [06:52:33] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467260 (https://phabricator.wikimedia.org/T206743) [06:52:53] (03CR) 10Muehlenhoff: [C: 04-1] "This still uses the memcached collector" [puppet] - 10https://gerrit.wikimedia.org/r/466908 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [06:54:22] (03CR) 10Muehlenhoff: [C: 04-1] hiera: remove diamond from wmcs role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/466908 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [06:54:33] (03CR) 10Ayounsi: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/12898/" [puppet] - 10https://gerrit.wikimedia.org/r/466787 (https://phabricator.wikimedia.org/T206778) (owner: 10Ayounsi) [06:55:40] !log add v6 monitoring for mr1-ulsfo OOB - T206778 [06:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:44] T206778: Configure v6 OOB for ulsfo - https://phabricator.wikimedia.org/T206778 [06:55:56] RECOVERY - puppet last run on cp1080 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:56:11] 10Operations, 10netops, 10Patch-For-Review: Configure v6 OOB for ulsfo - https://phabricator.wikimedia.org/T206778 (10ayounsi) 05Open>03Resolved All set. [06:56:16] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/466903 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [06:56:28] 10Operations, 10netops, 10Patch-For-Review: Configure v6 OOB for ulsfo - https://phabricator.wikimedia.org/T206778 (10ayounsi) a:05RobH>03ayounsi [06:56:56] RECOVERY - puppet last run on mw1305 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:06] RECOVERY - puppet last run on analytics1071 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:38] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:02:28] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467260 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui) [07:03:32] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467260 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui) [07:03:48] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467260 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui) [07:04:57] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1109 (duration: 00m 49s) [07:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:18] (03PS1) 10Marostegui: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467261 (https://phabricator.wikimedia.org/T206743) [07:09:55] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467261 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui) [07:11:01] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467261 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui) [07:12:16] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1104 - T206743 (duration: 00m 49s) [07:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:19] T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 [07:12:37] (03PS1) 10Elukey: Remove starting ' ' from Calendar date/tim in analytics' systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/467262 (https://phabricator.wikimedia.org/T172532) [07:12:39] (03PS1) 10Elukey: profile::analytics::refinery::job::data_purge: add systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/467263 (https://phabricator.wikimedia.org/T172532) [07:13:32] (03CR) 10Elukey: [C: 032] Remove starting ' ' from Calendar date/tim in analytics' systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/467262 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [07:14:20] (03CR) 10Muehlenhoff: [C: 04-1] "Actually, there's an additional detail I missed in my initial review: I forgot about our remaining ~20ish Ubuntu systems: On jessie/stretc" [puppet] - 10https://gerrit.wikimedia.org/r/464866 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [07:15:30] 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10Wikidata, and 4 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10Marostegui) Another test with db1104 before and after compressing: ``` root@db1104.eqiad.wmnet[wikida... [07:15:54] !log Stop MySQL at db1116:3318 to clone db1104 [07:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:39] (03CR) 10Elukey: [C: 032] profile::analytics::refinery::job::data_purge: add systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/467263 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [07:18:33] (03PS1) 10Muehlenhoff: Remove Diamond from DB roles [puppet] - 10https://gerrit.wikimedia.org/r/467264 (https://phabricator.wikimedia.org/T183454) [07:18:54] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467261 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui) [07:23:04] (03PS2) 10Banyek: mariadb: reimage db2096 [puppet] - 10https://gerrit.wikimedia.org/r/466856 (https://phabricator.wikimedia.org/T206593) [07:24:08] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1104 - T206743 (duration: 00m 48s) [07:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:11] T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 [07:24:54] (03CR) 10Marostegui: [C: 031] mariadb: reimage db2096 [puppet] - 10https://gerrit.wikimedia.org/r/466856 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek) [07:25:50] (03PS3) 10Banyek: mariadb: productionize db2096 [puppet] - 10https://gerrit.wikimedia.org/r/466846 (https://phabricator.wikimedia.org/T206593) [07:27:32] (03PS5) 10Banyek: mariadb: productionize db2096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466847 (https://phabricator.wikimedia.org/T206593) [07:28:24] (03CR) 10Marostegui: [C: 031] mariadb: productionize db2096 [puppet] - 10https://gerrit.wikimedia.org/r/466846 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek) [07:31:50] !log reimaging db2096 [07:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:12] !log reimaging db2096(T206593) [07:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:15] T206593: Productionize db2096 on x1 - https://phabricator.wikimedia.org/T206593 [07:34:29] (03CR) 10Banyek: [C: 032] mariadb: reimage db2096 [puppet] - 10https://gerrit.wikimedia.org/r/466856 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek) [07:36:15] (03PS3) 10Banyek: mariadb: reimage db2096 [puppet] - 10https://gerrit.wikimedia.org/r/466856 (https://phabricator.wikimedia.org/T206593) [07:36:39] (03CR) 10Banyek: [V: 032 C: 032] mariadb: reimage db2096 [puppet] - 10https://gerrit.wikimedia.org/r/466856 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek) [07:37:35] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [07:39:36] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [07:44:45] (03PS1) 10Addshore: Turn off wikidata dispatch verbose mode [puppet] - 10https://gerrit.wikimedia.org/r/467282 [07:57:33] !log reformat ms-be2040 with crc=1 finobt=0 - T199198 [07:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:36] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [08:00:33] 10Operations, 10DBA, 10MediaWiki-Database, 10Performance-Team, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) on checking the screen '91243.parsercache' on mwmaint1002 I can confirm that the key purge - I proceed [08:01:33] 10Operations, 10DBA, 10MediaWiki-Database, 10Performance-Team, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) a:03Banyek [08:01:46] 10Operations, 10DBA, 10MediaWiki-Database, 10Performance-Team, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) [08:02:12] 10Operations, 10DBA, 10MediaWiki-Database, 10Performance-Team, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) Sounds good! Also as T206740#4659202, let's create a separate task for the replication check addition, so we can just focus on the imm... [08:08:01] 10Operations, 10DBA, 10MediaWiki-Database, 10Performance-Team, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) [08:08:12] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:08:12] PROBLEM - swift-container-replicator on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:08:22] PROBLEM - swift-container-updater on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:08:42] PROBLEM - swift-object-auditor on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:08:43] PROBLEM - SSH on ms-be2040 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:08:52] PROBLEM - swift-account-replicator on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:08:53] PROBLEM - swift-container-auditor on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:09:12] PROBLEM - Disk space on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:09:52] PROBLEM - MD RAID on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:10:03] PROBLEM - dhclient process on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:10:22] PROBLEM - swift-container-replicator on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:10:52] PROBLEM - Check size of conntrack table on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:10:58] <_joe_> uhm that doesn't look healthy [08:11:02] PROBLEM - swift-account-replicator on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:11:02] PROBLEM - swift-container-auditor on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:11:03] RECOVERY - Disk space on ms-be2040 is OK: DISK OK [08:11:03] RECOVERY - dhclient process on ms-be2040 is OK: PROCS OK: 0 processes with command name dhclient [08:11:12] RECOVERY - swift-container-replicator on ms-be2040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [08:11:23] RECOVERY - swift-container-updater on ms-be2040 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [08:11:28] <_joe_> I can't seem to ssh into that server, anyone has any luck? [08:11:33] RECOVERY - swift-object-auditor on ms-be2040 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [08:11:43] RECOVERY - Check size of conntrack table on ms-be2040 is OK: OK: nf_conntrack is 0 % full [08:11:52] RECOVERY - swift-account-replicator on ms-be2040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [08:11:53] RECOVERY - swift-container-auditor on ms-be2040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:11:54] <_joe_> oh ahahah it's godog working on it :P [08:12:14] <_joe_> i must highlight the lines starting with !log [08:12:23] PROBLEM - very high load average likely xfs on ms-be2040 is CRITICAL: CRITICAL - load average: 162.66, 106.80, 51.98 [08:13:22] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:13:24] 10Operations, 10monitoring, 10Patch-For-Review: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10fgiunchedi) I was looking at {T206704} and likely einsteinium/tegmen addresses will be found in other places on router configuration too (including pfw like... [08:13:42] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 34 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [08:13:42] RECOVERY - SSH on ms-be2040 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) [08:13:52] PROBLEM - MD RAID on ms-be2040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:14:13] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2040 is OK: OK ferm input default policy is set [08:14:23] PROBLEM - very high load average likely xfs on ms-be2040 is CRITICAL: CRITICAL - load average: 128.53, 117.11, 62.95 [08:14:42] RECOVERY - MD RAID on ms-be2040 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [08:16:36] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web::prod_sites: convert wikisource.org [puppet] - 10https://gerrit.wikimedia.org/r/462486 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [08:16:50] (03PS5) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikisource.org [puppet] - 10https://gerrit.wikimedia.org/r/462486 (https://phabricator.wikimedia.org/T196968) [08:17:08] !log installing imagemagick security update [08:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:17] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467285 (https://phabricator.wikimedia.org/T206743) [08:17:32] RECOVERY - very high load average likely xfs on ms-be2040 is OK: OK - load average: 23.30, 72.58, 55.19 [08:17:35] (03CR) 10Marostegui: [C: 04-1] "Server still recloning" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467285 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui) [08:18:37] ah yes, sorry for the noise [08:19:45] 10Operations, 10Elasticsearch, 10Icinga, 10Discovery-Search (Current work), 10Patch-For-Review: reconfigure Icinga alert for elasticsearch_shard_size to reduce false positive alerts - https://phabricator.wikimedia.org/T206187 (10Gehel) I think the proposal make sense. This check is here so that we don't... [08:20:43] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 47 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [08:25:43] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 29 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [08:26:31] (03PS1) 10Elukey: Raise the Yarn Hadoop Resource Manager heap min/max to 4G [puppet] - 10https://gerrit.wikimedia.org/r/467286 (https://phabricator.wikimedia.org/T206943) [08:27:05] (03PS5) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikibooks.org [puppet] - 10https://gerrit.wikimedia.org/r/462487 (https://phabricator.wikimedia.org/T196968) [08:28:43] RECOVERY - Filesystem available is greater than filesystem size on ms-be2040 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2040&var-datasource=codfw%2520prometheus%252Fops [08:30:45] (03PS6) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikibooks.org [puppet] - 10https://gerrit.wikimedia.org/r/462487 (https://phabricator.wikimedia.org/T196968) [08:31:40] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/12901/" [puppet] - 10https://gerrit.wikimedia.org/r/467286 (https://phabricator.wikimedia.org/T206943) (owner: 10Elukey) [08:32:57] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 63 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [08:34:20] (03CR) 10Banyek: [C: 032] mariadb: productionize db2096 [puppet] - 10https://gerrit.wikimedia.org/r/466846 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek) [08:34:30] (03PS4) 10Banyek: mariadb: productionize db2096 [puppet] - 10https://gerrit.wikimedia.org/r/466846 (https://phabricator.wikimedia.org/T206593) [08:34:33] (03CR) 10Banyek: [V: 032 C: 032] mariadb: productionize db2096 [puppet] - 10https://gerrit.wikimedia.org/r/466846 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek) [08:34:44] (03CR) 10Elukey: [C: 032] Raise the Yarn Hadoop Resource Manager heap min/max to 4G [puppet] - 10https://gerrit.wikimedia.org/r/467286 (https://phabricator.wikimedia.org/T206943) (owner: 10Elukey) [08:34:51] (03PS2) 10Elukey: Raise the Yarn Hadoop Resource Manager heap min/max to 4G [puppet] - 10https://gerrit.wikimedia.org/r/467286 (https://phabricator.wikimedia.org/T206943) [08:35:56] banyek: o/ - ok to merge? [08:36:03] (I am on puppetmaster1001) [08:36:04] yes please [08:36:06] super [08:36:06] :) [08:37:53] I downtimed this RIPE atlas alert for 2 days, and will email the RIPE [08:38:29] XioNoX: ack [08:38:29] 10Operations, 10monitoring, 10Discovery-Search (Current work), 10Patch-For-Review: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10fgiunchedi) I did a quick audit in eqiad (for starters) to preview how we'd be affected by the alert, in this way: * Tunnel to... [08:42:33] (03CR) 10Filippo Giunchedi: [C: 04-1] "Left comments on related task (easier to track IMO)" [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114) (owner: 10Mathew.onipe) [08:44:01] (03PS1) 10Banyek: mariadb: depool db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467288 (https://phabricator.wikimedia.org/T206593) [08:44:18] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467285 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui) [08:44:42] (03CR) 10Marostegui: [C: 031] mariadb: depool db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467288 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek) [08:44:58] banyek: ^ I am deploying, so let me finish my change first as I have already +2ed mine [08:45:10] banyek: Will give you the green light once I am done [08:45:24] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1002/12903/mwdebug1001.eqiad.wmnet/ LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/462487 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [08:45:32] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467285 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui) [08:46:00] (03PS1) 10Ayounsi: Repool eqsin after power maintenance [dns] - 10https://gerrit.wikimedia.org/r/467289 (https://phabricator.wikimedia.org/T206861) [08:46:38] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1104 - T206743 (duration: 00m 49s) [08:46:39] banyek: I am done, you can go whenever you want [08:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:41] T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 [08:47:44] tx [08:47:52] can I have a +1 too? :) [08:47:58] (03CR) 10Ayounsi: [C: 032] Repool eqsin after power maintenance [dns] - 10https://gerrit.wikimedia.org/r/467289 (https://phabricator.wikimedia.org/T206861) (owner: 10Ayounsi) [08:48:35] ah, I got it, sorry [08:48:54] !log depooling db2033 (T206593) [08:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:58] T206593: Productionize db2096 on x1 - https://phabricator.wikimedia.org/T206593 [08:49:40] !log repool eqsin - T206861 [08:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:43] T206861: Power incident in eqsin - https://phabricator.wikimedia.org/T206861 [08:50:04] !log restart hadoop yarn resource managers on an-master* to pick up new jvm settings [08:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:52] (03CR) 10Banyek: [C: 032] mariadb: depool db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467288 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek) [08:53:15] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467285 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui) [08:53:17] (03CR) 10jenkins-bot: mariadb: depool db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467288 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek) [08:57:03] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467290 [08:58:46] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467290 (owner: 10Marostegui) [08:58:46] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 49.45 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:58:50] !log banyek@deploy1001 Synchronized wmf-config/db-codfw.php: T206593: depooling db2069 (duration: 00m 48s) [08:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:56] T206593: Productionize db2096 on x1 - https://phabricator.wikimedia.org/T206593 [08:59:22] banyek: you logged that you depooled db2033, is that intended? [08:59:25] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467290 (owner: 10Marostegui) [09:00:14] No, I messed up the message [09:00:31] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase traffic for db1104 (duration: 00m 48s) [09:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:48] I depooled db2069 but earlier I wanted to depool 2033 instead and I wrote that :/ [09:08:25] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467290 (owner: 10Marostegui) [09:08:52] 10Operations, 10netops, 10Patch-For-Review: IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 noisy alert - https://phabricator.wikimedia.org/T205829 (10ayounsi) 05Resolved>03Open The IPv6 pings eqiad alert keeps flapping, I downtimed it for 2 days and emailed the RIPE. [09:12:39] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/462487 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [09:14:17] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web::prod_sites: convert wikibooks.org [puppet] - 10https://gerrit.wikimedia.org/r/462487 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [09:14:26] (03PS7) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikibooks.org [puppet] - 10https://gerrit.wikimedia.org/r/462487 (https://phabricator.wikimedia.org/T196968) [09:16:45] (03PS1) 10Marostegui: db-eqiad.php: Restore weight for db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467293 [09:18:04] (03PS3) 10Muehlenhoff: mediawiki::web::prod_sites: convert vote.w.o [puppet] - 10https://gerrit.wikimedia.org/r/462492 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [09:18:56] (03PS2) 10Marostegui: db-eqiad.php: Restore weight for db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467293 [09:18:58] (03CR) 10Muehlenhoff: [C: 031] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/466879 (https://phabricator.wikimedia.org/T206844) (owner: 10Volans) [09:20:47] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore weight for db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467293 (owner: 10Marostegui) [09:21:53] (03Merged) 10jenkins-bot: db-eqiad.php: Restore weight for db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467293 (owner: 10Marostegui) [09:23:11] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Restore original weight for db1104 (duration: 00m 49s) [09:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:47] PROBLEM - Disk space on notebook1003 is CRITICAL: DISK CRITICAL - free space: /srv 4967 MB (3% inode=86%) [09:27:19] (03PS1) 10Marostegui: db-eqiad.php: Depool db1092 and db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467295 [09:28:35] * elukey checks notebook1003.. [09:28:36] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 33 probes of 317 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [09:30:52] (03CR) 10jenkins-bot: db-eqiad.php: Restore weight for db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467293 (owner: 10Marostegui) [09:32:30] (03CR) 10Vgutierrez: [C: 04-1] Central certificates service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [09:32:57] (03CR) 10Muehlenhoff: mediawiki::web::prod_sites: convert vote.w.o (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/462492 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [09:33:57] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 74.52 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:35:51] (03CR) 10Volans: [C: 04-1] "Just a quick pass, I skip the SSL config leaving it to Valentin and I didn't check with the compiler. See inline for the details" (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [09:36:59] (03PS4) 10Gehel: logrotate: add type contraints to parameters [puppet] - 10https://gerrit.wikimedia.org/r/466687 [09:37:16] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Depool db1092 and db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467295 (owner: 10Marostegui) [09:37:27] (03CR) 10Gehel: "puppet compiler agrees this is a noop https://puppet-compiler.wmflabs.org/compiler1002/12905/" [puppet] - 10https://gerrit.wikimedia.org/r/466687 (owner: 10Gehel) [09:38:12] (03CR) 10Gehel: [C: 032] logrotate: add type contraints to parameters [puppet] - 10https://gerrit.wikimedia.org/r/466687 (owner: 10Gehel) [09:38:50] 10Operations, 10monitoring, 10Performance-Team (Radar): "Workers" data from prometheus for mw app servers alternates strangely - https://phabricator.wikimedia.org/T206939 (10fgiunchedi) The `prometheus.svc` endpoint in eqiad and codfw is backed by two independent Prometheus servers scraping the same targets.... [09:40:57] (03PS4) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert vote.w.o [puppet] - 10https://gerrit.wikimedia.org/r/462492 (https://phabricator.wikimedia.org/T196968) [09:41:35] !log max_binlog_size is set back to 1048576000 on ParseCache hosts (T206740) [09:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:39] T206740: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 [09:42:14] (03PS2) 10Marostegui: db-eqiad.php: Depool db1092 and db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467295 [09:43:07] 10Operations, 10DBA, 10MediaWiki-Database, 10Performance-Team, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) [09:43:21] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Depool db1092 and db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467295 (owner: 10Marostegui) [09:43:28] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1001/12910/mwdebug1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/462492 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [09:43:33] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1092 and db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467295 (owner: 10Marostegui) [09:43:42] (03PS5) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert vote.w.o [puppet] - 10https://gerrit.wikimedia.org/r/462492 (https://phabricator.wikimedia.org/T196968) [09:44:39] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1092 and db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467295 (owner: 10Marostegui) [09:45:33] 10Operations, 10Performance-Team, 10monitoring, 10Availability, 10Patch-For-Review: Perform a statsd and Graphite switch - https://phabricator.wikimedia.org/T206963 (10fgiunchedi) The most similar task is likely {T88997} and related. As far as graphite goes sending carbon line-oriented traffic is already... [09:45:43] !log Stop MySQL on db1116:3318 to reclone db1092 [09:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:49] 10Operations, 10cloud-services-team: User:Susannaanas has lost SSH private key in a computer crash - how to recover the developer account? - https://phabricator.wikimedia.org/T207005 (10MarcoAurelio) [09:45:59] 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10Wikidata, and 4 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10Addshore) I made a sub ticket for adding this to Wikibase itself for 3rd party users. [09:46:10] (03CR) 10Gehel: [C: 031] "puppet compiler agrees this is a NOOP: https://puppet-compiler.wmflabs.org/compiler1002/12909/" [puppet] - 10https://gerrit.wikimedia.org/r/466693 (owner: 10Gehel) [09:46:16] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1092 for recloning - T206743 (duration: 00m 49s) [09:46:17] (03PS2) 10Gehel: rsyslog: replace deprecated validate_numeric() with type contraints [puppet] - 10https://gerrit.wikimedia.org/r/466693 [09:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:19] T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 [09:47:21] (03CR) 10Gehel: [C: 032] rsyslog: replace deprecated validate_numeric() with type contraints [puppet] - 10https://gerrit.wikimedia.org/r/466693 (owner: 10Gehel) [09:47:31] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1092 and db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467295 (owner: 10Marostegui) [09:50:42] (03PS3) 10Muehlenhoff: mediawiki::web::prod_sites: convert test.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/462493 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [09:51:27] PROBLEM - SSH cp5004.mgmt on cp5004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:52:58] (03PS2) 10Gehel: base::service_unit: add type constraints on parameters [puppet] - 10https://gerrit.wikimedia.org/r/466697 [09:53:52] XioNoX: ^^ cp5004 [09:54:12] (03CR) 10Gehel: "puppet compiler agrees this is a NOOP: https://puppet-compiler.wmflabs.org/compiler1002/12912/" [puppet] - 10https://gerrit.wikimedia.org/r/466697 (owner: 10Gehel) [09:54:17] (03CR) 10Gehel: [C: 032] base::service_unit: add type constraints on parameters [puppet] - 10https://gerrit.wikimedia.org/r/466697 (owner: 10Gehel) [09:55:02] (03PS2) 10Volans: cumin: enable known hosts backend in prod [puppet] - 10https://gerrit.wikimedia.org/r/466879 (https://phabricator.wikimedia.org/T206844) [09:55:54] 10Operations, 10SRE-Access-Requests: Requesting access to servers for Release Engineering tasks for Lars Wirzenius - https://phabricator.wikimedia.org/T206612 (10LarsWirzenius) My wikitech account is LarsWirzenius, my preferred Unix username is liw. I've signed L3 on Oct 10. @greg could you approve the reques... [09:57:42] whut [09:58:06] (03CR) 10Gehel: "looks good, minor style issue, see comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/467097 (owner: 10Smalyshev) [09:58:46] vgutierrez: it's only mgmt (the server seem to be working fine) and only cp5004 [09:59:11] (03CR) 10Muehlenhoff: "This can also use legacy_rewrites=false, otherwise looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/462493 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [09:59:14] so no idea what's going on, I don't know much about ilo/idrac [10:00:03] is there a way to reset/bounce it from the host side? [10:00:38] hmmm [10:01:07] bast5001 is able to ping the mgmt interface and it's performing a tcp 3way handshake against cp5004.mgmt as expected [10:01:19] so it's only ssh? [10:01:31] XioNoX, vgutierrez: there is some doc in https://wikitech.wikimedia.org/wiki/Management_Interfaces#Reset_the_management_card [10:01:37] but requires SSH access to the console [10:02:03] oh irony [10:02:05] 3way handshake against 22/tcp I mean [10:02:12] yeah, that requires working SSH, otherwise smart hands need to chime in [10:04:53] cp5004.mgmt.eqsin.wmnet [10.132.129.104] 22 (ssh) open [10:04:56] that from einstenium [10:08:25] hmmm ok.. the TCP is actually being established but for some reason is not even sending the service banner [10:09:22] 10Operations, 10monitoring, 10Discovery-Search (Current work), 10Patch-For-Review: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10Gehel) >>! In T206114#4665616, @fgiunchedi wrote: > Just in eqiad there's 839 matches, so likely we'll need some filtering/tuni... [10:09:34] 10Operations, 10DNS, 10Traffic: Create redirect to integration.wikimedia.org/zuul - https://phabricator.wikimedia.org/T207008 (10MarcoAurelio) [10:10:27] I can actually connect to cp5004's mgmt [10:10:36] 10Operations, 10monitoring, 10Discovery-Search (Current work), 10Patch-For-Review: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10Volans) I would consider also making the threshold a percentage of the normal traffic. [10:10:37] also getting a serial console [10:11:33] (03CR) 10Volans: [C: 032] cumin: enable known hosts backend in prod [puppet] - 10https://gerrit.wikimedia.org/r/466879 (https://phabricator.wikimedia.org/T206844) (owner: 10Volans) [10:11:35] moritzm: and now we're getting a proper banner there [10:11:50] first login was ridiculously slow, maybe a minute until giot the password prompt, but works in usual speed for a second attempt [10:12:23] if anything is missing in the IPMI wiki troubleshoot page let me know [10:12:47] RECOVERY - SSH cp5004.mgmt on cp5004.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) [10:12:50] I didn't do anything except logging in very slowly [10:13:01] 10Operations, 10monitoring, 10Discovery-Search (Current work), 10Patch-For-Review: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10Gehel) >>! In T206114#4665853, @Volans wrote: > I would consider also making the threshold a percentage of the normal traffic.... [10:13:06] that recovery is from me, I just re-triggered the service check manually [10:13:17] yeah I think we collided in that :) [10:13:45] good, that shows that mgmt that we really mean it! [10:13:46] so just another IPMI behaving weird :) [10:14:14] hardware ¯\_(ツ)_/¯ [10:15:37] <_joe_> moritzm: we should move to the cloud! [10:15:56] * vgutierrez starts the burning money machine [10:16:28] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10fgiunchedi) I ran @Krinkle script to audit grafana dashboards at https://gist.github.com/Krinkle/b5ceff5156c1f4cf3568e373cc135bad to gauge where... [10:17:02] (03PS5) 10Gehel: relforge: setup 2 instances to validate multi-instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) [10:17:38] (03CR) 10jerkins-bot: [V: 04-1] relforge: setup 2 instances to validate multi-instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) (owner: 10Gehel) [10:18:11] (03PS1) 10Elukey: profile::analytics::refinery::job::data_purge: move crons to timers [puppet] - 10https://gerrit.wikimedia.org/r/467299 (https://phabricator.wikimedia.org/T172532) [10:19:56] (03CR) 10Volans: [C: 04-1] "Why is this needed? I see /var/run/icinga is there on icinga1001 and contains the Icinga pid as on jessie. The only differences are the pe" [puppet] - 10https://gerrit.wikimedia.org/r/467017 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [10:21:45] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10MoritzMuehlenhoff) >>! In T183454#4665862, @fgiunchedi wrote: > I ran @Krinkle script to audit grafana dashboards at https://gist.github.com/Krin... [10:25:27] 10Operations: Onboarding Cas Rusnov - https://phabricator.wikimedia.org/T207009 (10Volans) p:05Triage>03Normal [10:27:18] (03CR) 10Vgutierrez: [C: 031] "nice work, it has already helped me :) take https://gerrit.wikimedia.org/r/#/c/operations/dns/+/451614/ as an example of an issue detected" [dns] - 10https://gerrit.wikimedia.org/r/444649 (https://phabricator.wikimedia.org/T182028) (owner: 10Volans) [10:28:58] (03PS4) 10Volans: Add zone_validator script [dns] - 10https://gerrit.wikimedia.org/r/444649 (https://phabricator.wikimedia.org/T182028) [10:29:45] (03CR) 10Volans: "> Patch Set 3: Code-Review+1" [dns] - 10https://gerrit.wikimedia.org/r/444649 (https://phabricator.wikimedia.org/T182028) (owner: 10Volans) [10:30:04] jan_drewniak: I, the Bot under the Fountain, allow thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181015T1030). [10:31:12] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467300 (https://phabricator.wikimedia.org/T128546) [10:32:12] (03PS4) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert test.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/462493 (https://phabricator.wikimedia.org/T196968) [10:33:26] (03CR) 10Volans: [C: 032] Add zone_validator script [dns] - 10https://gerrit.wikimedia.org/r/444649 (https://phabricator.wikimedia.org/T182028) (owner: 10Volans) [10:34:46] PROBLEM - Device not healthy -SMART- on heze is CRITICAL: cluster=misc device=megaraid,10 instance=heze:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=heze&var-datasource=codfw%2520prometheus%252Fops [10:35:18] (03CR) 10Jdrewniak: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467300 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:35:50] (03PS1) 10Urbanecm: Throttle exception Czech Senior Citizens Write Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467302 (https://phabricator.wikimedia.org/T206993) [10:36:01] (03CR) 10Muehlenhoff: [C: 04-1] "https://grafana.wikimedia.org/dashboard/db/parsoid-servers-cpu-usage needs to be fixed first." [puppet] - 10https://gerrit.wikimedia.org/r/465441 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [10:36:23] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467300 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:42:28] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:467300|Bumping portals to master (T128546)]] (duration: 00m 49s) [10:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:32] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:43:18] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:467300|Bumping portals to master (T128546)]] (duration: 00m 49s) [10:43:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:44] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10fgiunchedi) >>! In T183454#4665862, @fgiunchedi wrote: > I ran @Krinkle script to audit grafana dashboards at https://gist.github.com/Krinkle/b5c... [10:46:26] (03PS1) 10Ladsgroup: Enable reading from ct_tag_id in s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467307 (https://phabricator.wikimedia.org/T194164) [10:47:44] !log installing tomcat7 security updates [10:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:17] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467300 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:50:41] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: DNS repo: add CI checks for obvious configuration errors - https://phabricator.wikimedia.org/T182028 (10Volans) The script has been merged into the DNS repo, it can be run locally with `python3 utils/zone_validator.py -h`, it has no externa... [10:52:06] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [10:53:07] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [10:54:53] <_joe_> mobrovac: is that mcs ^^ ? [10:55:33] yeah, these time out sometimes, and we are not sure why _joe_ [10:55:36] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on einsteinium is CRITICAL: 196.8 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [10:55:42] there's a ticket about it, so we'll investigate soon [10:57:58] !log installing ghostscript security updates for jessie [10:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:15] <_joe_> something is going wrong [10:59:23] <_joe_> see the log ingestion rate alert [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181015T1100). [11:00:04] Urbanecm and Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:05] they are DbReplication logs in logstash [11:00:09] o/ [11:00:09] <_joe_> yes [11:00:13] it looks like [11:00:14] <_joe_> marostegui: Wikimedia\Rdbms\LoadBalancer::getRandomNonLagged: server db1092 is not replicating? [11:00:20] let me guess, snapshost [11:00:20] o/ [11:00:21] Wikimedia\Rdbms\LoadBalancer::getRandomNonLagged: server db1092 is not replicating? [11:00:22] here! [11:00:26] _joe_: snapshoting is broken [11:00:30] _joe_: it is depooled and yeah, not replicating [11:00:33] taht host is depooled [11:00:44] <_joe_> so why is mediawiki alerting about it? [11:00:59] because php only reloads the config per web request [11:01:07] <_joe_> doesn't look that depooled to me [11:01:15] <_joe_> the alert is ongiung since 15 minutes [11:01:20] <_joe_> *ongoing [11:01:23] its from /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpJson.php [11:01:37] _joe_: https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php&1 [11:01:37] _joe_: https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php it is [11:01:52] please send a bug to snapshot generation, we did many times [11:02:02] <_joe_> yes, I know. The errors are either from snapshots, or some from the appservers [11:02:13] <_joe_> but those stopped quite some time ago, right [11:02:19] <_joe_> it's just snapshots right now [11:02:44] https://phabricator.wikimedia.org/T138208 [11:02:55] <_joe_> yeah snapshots make sense [11:02:58] "I would like the ability to tell LB to drop the current connection AND config and re-read. This would be very handy in general." [11:03:02] ^ [11:03:06] complain there [11:03:24] mediawiki doesn't deploy the only way we have to depool a server [11:03:56] <_joe_> I was not complaining, I was just checking production errors [11:04:17] I want you to complain ! [11:04:19] :-D [11:04:35] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web::prod_sites: convert test.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/462493 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [11:04:37] so I am not the only one [11:04:38] <_joe_> I know :P [11:04:45] Amir1: go ahead with your patch while I review Urbanecm's patches [11:04:51] sure [11:05:11] it is also on my list of top 5 architecture mistakes [11:05:35] <_joe_> jynus: well to be fair [11:05:35] I will comment on that task :-) [11:05:40] _joe_: we were in a meeting, sorry [11:06:07] <_joe_> for a web request the assumption is ok (that you can load the db config at startup); for long-running jobs, it's not [11:06:10] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467307 (https://phabricator.wikimedia.org/T194164) (owner: 10Ladsgroup) [11:06:16] _joe_: hence the arch mistake [11:06:35] I will ask hoo o lok at it again (or poke someone) [11:06:38] <_joe_> well it's the duty of those running long jobs to keep that into account [11:06:39] *to look [11:06:40] which would be solved by either having a server context (propper server) [11:06:44] sounds like the mw loadbalancer just needs something to allow it to reload config every few mins [11:06:47] or to imitate that [11:06:48] or every min [11:06:56] like you do with etcd [11:07:01] apergos: Talking about https://phabricator.wikimedia.org/T147169? [11:07:03] or poolcounter, memcache [11:07:05] etc, [11:07:10] <_joe_> jynus: no, what we do with etcd won't help you here [11:07:12] (03Merged) 10jenkins-bot: Enable reading from ct_tag_id in s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467307 (https://phabricator.wikimedia.org/T194164) (owner: 10Ladsgroup) [11:07:16] * hoo was literally looking at this right now [11:07:21] <_joe_> that's not what's the problem :P [11:07:34] _joe_: I was just saying that is one of the solutions to a similar problem [11:07:39] <_joe_> etcd data is anyways read just once per script run [11:07:39] not to this one [11:07:58] <_joe_> unless you force it to happen :P [11:08:07] config in this case should be pushed, not pulled every request [11:08:15] hoo, well sort of [11:08:27] for example, if we had a non-mediawiki controlled load balancer [11:08:33] for what is worth, I just finished recloning db1092 so I am starting mysql there [11:08:36] it's more about being able to update the config when a db server is depooled [11:08:40] I could just update the proxy config, and be applied automatically to all connections [11:09:02] and maybe kill those not updated after X seconds [11:09:32] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:467307|Enable reading from ct_tag_id in s7 (T194164)]] (duration: 00m 49s) [11:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:35] T194164: Start reading from change_tag_def in production - https://phabricator.wikimedia.org/T194164 [11:09:48] jynus: that sounds lovely IMO :P [11:10:16] addshore: I cannot do that, mediawiki controls the load balancer [11:10:37] I was on mwmaint1002 and did a status of hhvm to check other things [11:10:40] <_joe_> jynus: just because we didn't configure it not to [11:10:42] It's deployed zeljkof SWAT is yours [11:10:42] and I found: [11:10:56] I might jump back to revert if graphs go crazy (unlikely) [11:10:57] Notice: Use of undefined constant SCHEMA_COMPAT_WRITE_BOTH - assumed 'SCHEMA_COMPAT_WRITE_BOTH' and same for SCHEMA_COMPAT_READ_NEW [11:11:02] Amir1: thanks, taking over swat [11:11:09] doesn't seem normal, but I have zero context on those [11:11:19] (03PS2) 10Urbanecm: Throttle exception Czech Senior Citizens Write Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467302 (https://phabricator.wikimedia.org/T206993) [11:11:19] does it ring a bell? [11:11:19] _joe_: because application ha is now highly tied to the business logic [11:11:22] <_joe_> zeljkof: know anything abut that? [11:11:47] I know what those are but I don't know why they should be undefined (those constants) [11:11:50] InitialiseSettings.php on line 20794 and 20795 [11:11:52] those are for mcr [11:11:53] volans: MCR [11:11:54] _joe_: about what? [11:12:02] <_joe_> 11:10 < volans> Notice: Use of undefined constant SCHEMA_COMPAT_WRITE_BOTH - assumed 'SCHEMA_COMPAT_WRITE_BOTH' and same for SCHEMA_COMPAT_READ_NEW │········ [11:12:04] zeljkof, I have uploaded PS2 to 467302, FYI [11:12:15] (03CR) 10jerkins-bot: [V: 04-1] Throttle exception Czech Senior Citizens Write Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467302 (https://phabricator.wikimedia.org/T206993) (owner: 10Urbanecm) [11:12:18] _joe_: sorry, I have no clue [11:12:23] <_joe_> I was wondering if it was due to swat or not [11:12:25] <_joe_> I guess not [11:12:40] jynus: ack, but it's normal they are undefined? [11:12:55] (03PS3) 10Urbanecm: Throttle exception Czech Senior Citizens Write Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467302 (https://phabricator.wikimedia.org/T206993) [11:12:56] _joe_: Amir1 has just deployed https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/467307 [11:13:03] I didn't deploy anything yet [11:13:16] volans: that is the only thing I know, "sounds like MCR", sorry [11:13:28] k, thanks [11:13:29] they should be defined [11:13:42] It seems like MCR [11:13:46] it is mcr [11:13:55] The constant I'm using is MIGRATION_NEW [11:13:56] yes in the wgMultiContentRevisionSchemaMigrationStage block [11:14:08] <_joe_> apergos: they're not [11:14:10] apparently on mwmaint1002 are not defined [11:14:16] not sure elsewhere [11:14:28] Urbanecm: please stand by, you're next [11:14:29] <_joe_> apergos: they're defined only in tests/Defines.php AFAICS [11:14:31] kk [11:14:43] * apergos goes to grep mediawiki core again [11:14:55] no the config files, forgot [11:15:16] ./wmf-config/InitialiseSettings.php: 'default' => SCHEMA_COMPAT_WRITE_BOTH | SCHEMA_COMPAT_READ_OLD, [11:15:21] let's see if that's still true [11:15:23] *reads up* [11:16:06] still true. let's see where they are actually set though [11:16:26] Urbanecm: there is trailing whitespace in the commit message of 465283 [11:16:32] <_joe_> volans: the constants are defined in includes/Defines.php in mediawiki core [11:16:44] <_joe_> so it's pretty strange the error you see [11:16:56] <_joe_> volans: mwmaint1002, hhvm service? [11:16:58] <_joe_> that's NOC [11:16:59] (03PS10) 10Urbanecm: Add two throttle rules and remove outdated rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465283 (https://phabricator.wikimedia.org/T206408) (owner: 10Zoranzoki21) [11:17:02] <_joe_> aha [11:17:03] _joe_: yes [11:17:07] <_joe_> volans: ok disregard [11:17:10] zeljkof, better? [11:17:14] (03CR) 10Vgutierrez: [C: 04-1] Central certificates service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [11:17:15] probably includes/Defines [11:17:17] hello. IIRC those constants are for the actor change ( https://phabricator.wikimedia.org/T188327 ). And it used to be feature flagged [11:17:22] <_joe_> apergos: it's there, yes [11:17:22] Urbanecm: also please update the first line of the commit message to mention the events [11:17:28] noc.wm.o? :P [11:17:40] (sorry, I don't see this window when I'm in another one grepping around) [11:17:41] <_joe_> apergos: the issue is we don't include those in noc.w.o [11:17:43] <_joe_> :) [11:17:46] ah [11:17:57] I never look at noc, just at my clone of the repo [11:17:58] <_joe_> so, yeah, red herring [11:18:07] <_joe_> no I mean [11:18:12] <_joe_> the server spitting those alerts [11:18:18] <_joe_> is the one serving noc.wikimedia.org [11:18:29] zeljkof, are you sure it is possible in 70 chars? [11:18:53] Urbanecm: nevermind, you have a point [11:19:33] (03CR) 10jenkins-bot: Enable reading from ct_tag_id in s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467307 (https://phabricator.wikimedia.org/T194164) (owner: 10Ladsgroup) [11:19:38] oic [11:19:46] that's... unfortunate :-) [11:20:06] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465283 (https://phabricator.wikimedia.org/T206408) (owner: 10Zoranzoki21) [11:20:47] hmm, it seems I need to do something again [11:21:08] (03Merged) 10jenkins-bot: Add two throttle rules and remove outdated rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465283 (https://phabricator.wikimedia.org/T206408) (owner: 10Zoranzoki21) [11:21:14] thanks everyone and sorry for the trouble, but looked strande [11:21:16] *strage [11:23:07] zeljkof: the change is not deployed there [11:23:17] (03PS2) 10Elukey: profile::analytics::refinery::job::data_purge: move crons to timers [puppet] - 10https://gerrit.wikimedia.org/r/467299 (https://phabricator.wikimedia.org/T172532) [11:23:19] Amir1: where? [11:23:26] I'm double checking, does settings accept db groups? like s7 [11:23:30] in prod [11:23:47] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/467307/1/wmf-config/InitialiseSettings.php [11:23:53] Amir1: wait, I don't knwo what your're talking about [11:23:55] Amir1, it does accept dblists [11:24:15] Amir1: why isn't it deployed? [11:24:31] Checking out why [11:24:49] Amir1, does it throw some error message? [11:25:14] no no, it's just it deson't have the value when I echo it in eval.php [11:26:50] !log zfilipin@deploy1001 scap failed: average error rate on 4/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details) [11:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:49] Amir1, can it be just cache? [11:28:02] Amir1: ok, so scap problems [11:28:06] ^ [11:28:23] lt might be [11:28:41] zeljkof: Do you know if s7 is valid to use there? [11:28:52] Amir1: I don't know [11:29:04] I checked and it's section group is not used at all anywhere esle [11:29:24] !log Started rebuildItemsPerSite on mwmaint1002 (T44325). Can be killed at any time, if necessary. [11:29:24] `BadMethodCallException from line 72 of /srv/mediawiki/php-1.32.0-wmf.24/includes/api/ApiFeedContributions.php` [11:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:29] T44325: Prevent creation of items having the same sitelinks (duplicates) - https://phabricator.wikimedia.org/T44325 [11:29:39] Amir1: is this something you could have caused? ^ [11:29:49] it's likely [11:29:53] where is it? [11:30:14] Urbanecm: sorry, scap problems, I'll revert the patch I already merged [11:30:18] Amir1: https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 [11:31:11] zeljkof: no it's not me. The wikis are not in deployed wikis [11:31:15] (03PS1) 10Zfilipin: Revert "Add two throttle rules and remove outdated rule" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467311 [11:31:23] ocwiki for example [11:31:27] zeljkof, scap problems? with a throttle rule? [11:31:41] Urbanecm: something is wrong [11:31:42] that's something I'm seeing for the first time... [11:31:57] `11:26:49 sync-file failed: scap failed: average error rate on 4/11 canaries increased by 10x ` [11:32:21] seems so [11:32:23] not because of the rule, but I can't deploy it, I have to revert the commit [11:32:32] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467311 (owner: 10Zfilipin) [11:33:30] I see. Will you try to figure out the cause or do we have to not deploy all the patches, [11:33:37] (03Merged) 10jenkins-bot: Revert "Add two throttle rules and remove outdated rule" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467311 (owner: 10Zfilipin) [11:33:54] Urbanecm: I'll abort swat for today, I don't know what the problem is [11:34:02] (03PS1) 10Volans: mediawiki: kill also HHVM on stop_cronjobs [software/spicerack] - 10https://gerrit.wikimedia.org/r/467313 (https://phabricator.wikimedia.org/T207014) [11:34:25] Ok. If you create a phab ticket, please add me as subscriber. Thanks! [11:34:41] (03CR) 10jenkins-bot: Add two throttle rules and remove outdated rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465283 (https://phabricator.wikimedia.org/T206408) (owner: 10Zoranzoki21) [11:36:04] Urbanecm: please submit 465283 as three separate patches, one cleanup, and patch per event [11:36:12] will do [11:36:30] Amir1: were you able to deploy your commit at the start of SWAT [11:37:23] zeljkof: yup [11:37:25] (03CR) 10jenkins-bot: Revert "Add two throttle rules and remove outdated rule" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467311 (owner: 10Zfilipin) [11:37:25] (03PS1) 10Mathew.onipe: wdqs: removed rspec tests [puppet] - 10https://gerrit.wikimedia.org/r/467314 (https://phabricator.wikimedia.org/T204240) [11:37:27] (03PS1) 10Ladsgroup: Enable reading from new backend of change_tag in s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467315 (https://phabricator.wikimedia.org/T194164) [11:37:28] even it went on SAL [11:37:45] but I don't think it had any effect [11:37:47] (03PS1) 10Urbanecm: Remove expired throttle exceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467316 [11:38:05] (03PS1) 10Jcrespo: mariadb: Create socket dir also on puppet run, & for multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/467317 (https://phabricator.wikimedia.org/T207013) [11:38:14] Amir1, Urbanecm: ok, I'll abort swat, and create a phab ticket [11:38:52] (03CR) 10Jcrespo: "The regex is untested." [puppet] - 10https://gerrit.wikimedia.org/r/467317 (https://phabricator.wikimedia.org/T207013) (owner: 10Jcrespo) [11:38:59] !log EU SWAT finished [11:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:55] (03CR) 10Jcrespo: "CC _joe_ in case he wants to fix this behavior in the class and not just for dbs." [puppet] - 10https://gerrit.wikimedia.org/r/467317 (https://phabricator.wikimedia.org/T207013) (owner: 10Jcrespo) [11:41:25] (03Abandoned) 10Urbanecm: Throttle exception Czech Senior Citizens Write Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467302 (https://phabricator.wikimedia.org/T206993) (owner: 10Urbanecm) [11:41:46] (03PS1) 10Mathew.onipe: tilerator: removed rspec tests [puppet] - 10https://gerrit.wikimedia.org/r/467319 (https://phabricator.wikimedia.org/T204240) [11:43:53] (03PS1) 10Fdans: Add change_tag to the list of tables to sqoop in cron [puppet] - 10https://gerrit.wikimedia.org/r/467320 (https://phabricator.wikimedia.org/T205940) [11:44:04] (03PS1) 10Urbanecm: Add throttle rule for "Night of the Digital Language" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467321 [11:44:40] (03CR) 10Fdans: [C: 04-1] "Minusoneing until https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/465416/ is merged and deployed." [puppet] - 10https://gerrit.wikimedia.org/r/467320 (https://phabricator.wikimedia.org/T205940) (owner: 10Fdans) [11:46:59] (03PS2) 10Urbanecm: Remove expired throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467316 (https://phabricator.wikimedia.org/T207015) [11:48:53] zeljkof: https://logstash.wikimedia.org/goto/66077fe458edbc973c40f67e262ab9ac It has been happening for a while now [11:51:27] Amir1: I'll report the problem in phab and cc you, Urbanecm and liw [11:52:01] Thanks. I looked at the code and I think I know how to reproduce it and what's the issue [11:52:08] I'll comment in the code [11:57:29] !log start of mwscript deleteLocalPasswords.php --delete --batch-size 200 on all wikis [11:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:36] let's get the party started [12:00:58] (03PS1) 10Mathew.onipe: elasticsearch: modify thresholds for icinga check shard size plugin [puppet] - 10https://gerrit.wikimedia.org/r/467322 (https://phabricator.wikimedia.org/T206187) [12:01:39] (03PS1) 10Urbanecm: Initial configuration for shnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467323 (https://phabricator.wikimedia.org/T206777) [12:02:28] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for shnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467323 (https://phabricator.wikimedia.org/T206777) (owner: 10Urbanecm) [12:04:23] Amir1, Urbanecm, liw: T207018 [12:04:24] T207018: RuntimeError: scap failed: average error rate on 4/11 canaries increased by 10x - https://phabricator.wikimedia.org/T207018 [12:04:31] I'll add more data [12:04:37] to the task [12:04:40] thx [12:04:47] also, feel free to update the task with more data [12:04:57] zeljkof, ack [12:07:31] (03PS2) 10Urbanecm: Initial configuration for shnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467323 (https://phabricator.wikimedia.org/T206777) [12:19:07] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi: rack/setup/install centrallog1001.eqiad.wmnet - https://phabricator.wikimedia.org/T200706 (10fgiunchedi) Not really "logstash" but using #wikimedia-logstash for logging-related tasks [12:25:22] (03CR) 10Filippo Giunchedi: [C: 032] debian: use standard rules for Prometheus packages [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/465351 (owner: 10Filippo Giunchedi) [12:25:32] (03CR) 10Filippo Giunchedi: [C: 032] debian: update changelog [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/465352 (owner: 10Filippo Giunchedi) [12:25:46] (03CR) 10Filippo Giunchedi: [C: 032] debian: add patch for inline udp usage [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/465414 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [12:26:08] (03CR) 10Filippo Giunchedi: [C: 032] debian: ship systemd service [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/465350 (owner: 10Filippo Giunchedi) [12:28:32] (03CR) 10Filippo Giunchedi: [C: 032] Merge tag 'upstream/0.7.0' [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/465349 (owner: 10Filippo Giunchedi) [12:28:35] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Merge tag 'upstream/0.7.0' [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/465349 (owner: 10Filippo Giunchedi) [12:30:14] (03PS3) 10Urbanecm: Initial configuration for shnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467323 (https://phabricator.wikimedia.org/T206777) [12:32:32] (03PS5) 10Urbanecm: Initial configuration for yuewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463482 (https://phabricator.wikimedia.org/T205546) [12:33:54] (03PS6) 10Urbanecm: Initial configuration for yuewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463482 (https://phabricator.wikimedia.org/T205546) [12:34:11] (03PS6) 10Urbanecm: Initial configuration for liwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463479 (https://phabricator.wikimedia.org/T205710) [12:35:10] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for liwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463479 (https://phabricator.wikimedia.org/T205710) (owner: 10Urbanecm) [12:35:12] (03CR) 10Elukey: [C: 031] Clean up removed rsyncd configs [puppet] - 10https://gerrit.wikimedia.org/r/465583 (https://phabricator.wikimedia.org/T205618) (owner: 10Muehlenhoff) [12:35:22] (03CR) 10Gehel: [C: 032] elasticsearch: modify thresholds for icinga check shard size plugin [puppet] - 10https://gerrit.wikimedia.org/r/467322 (https://phabricator.wikimedia.org/T206187) (owner: 10Mathew.onipe) [12:41:17] !log upgrade prometheus-memcached-exporter on swift and thumbor [12:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:01] !log complete rolling restart of eventbus on kafka[12]00[1-3] for python security upgrades (only codfw was done) [12:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:06] PROBLEM - configured eth on notebook1003 is CRITICAL: Return code of 255 is out of bounds [12:44:09] Cc: mobrovac --^ [12:44:36] PROBLEM - MD RAID on notebook1003 is CRITICAL: Return code of 255 is out of bounds [12:44:37] !log reseting kafka offsets on wdqs public cluster [12:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:46] PROBLEM - puppet last run on notebook1003 is CRITICAL: Return code of 255 is out of bounds [12:44:47] PROBLEM - dhclient process on notebook1003 is CRITICAL: Return code of 255 is out of bounds [12:44:57] PROBLEM - DPKG on notebook1003 is CRITICAL: Return code of 255 is out of bounds [12:44:57] PROBLEM - Check systemd state on notebook1003 is CRITICAL: Return code of 255 is out of bounds [12:45:08] (03PS2) 10Gehel: My tests show that Kafka poller behaves much better with -b 700 [puppet] - 10https://gerrit.wikimedia.org/r/467002 (owner: 10Smalyshev) [12:45:26] kk thnx elukey [12:45:40] !log rebooting db2096 [12:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:23] (03PS3) 10Gehel: My tests show that Kafka poller behaves much better with -b 700 [puppet] - 10https://gerrit.wikimedia.org/r/467002 (owner: 10Smalyshev) [12:47:32] (03PS4) 10Gehel: wdqs: increase updater batch to 700 for kafka [puppet] - 10https://gerrit.wikimedia.org/r/467002 (owner: 10Smalyshev) [12:49:17] (03CR) 10Gehel: [C: 032] wdqs: increase updater batch to 700 for kafka [puppet] - 10https://gerrit.wikimedia.org/r/467002 (owner: 10Smalyshev) [12:53:07] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [12:53:26] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [12:54:16] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy [12:54:26] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [12:55:37] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on einsteinium is OK: (C)130 ge (W)110 ge 85.73 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [12:58:05] (03PS1) 10Gehel: wdqs: re-enable kafka poller on wdqs public cluster [puppet] - 10https://gerrit.wikimedia.org/r/467331 [12:58:35] 10Operations, 10Thumbor, 10Patch-For-Review, 10Performance-Team (Radar), 10User-fgiunchedi: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10Gilles) [12:59:29] sigh [13:00:20] (03CR) 10Giuseppe Lavagetto: mcrouter: allow defining a non-default number of backend connectors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/466881 (https://phabricator.wikimedia.org/T203786) (owner: 10Giuseppe Lavagetto) [13:01:36] 10Operations: v6 ND failure on puppetmaster1001/asw2-b-eqiad - https://phabricator.wikimedia.org/T200838 (10ayounsi) 05Open>03Resolved Addressed in T201039#4650390 [13:01:41] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team, 10netops: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585 (10ayounsi) [13:02:36] !log upload prometheus-statsd-exporter 0.7.0 - T205870 [13:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:40] T205870: Provision >= 50% of statsd/Graphite-only metrics in Prometheus - https://phabricator.wikimedia.org/T205870 [13:02:41] 10Operations, 10DNS, 10Traffic, 10Wiki-Setup (Rename): Redirect dk.wiktionary and dk.wikibooks to da.wiktionary and da.wikibooks respectively. - https://phabricator.wikimedia.org/T17357 (10MarcoAurelio) [13:03:12] (03PS2) 10Giuseppe Lavagetto: mcrouter: allow defining a non-default number of backend connectors [puppet] - 10https://gerrit.wikimedia.org/r/466881 (https://phabricator.wikimedia.org/T203786) [13:03:50] (03PS1) 10Marostegui: db-eqiad.php: Repool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467332 [13:05:46] (03PS1) 10Jgreen: rename saiph to frpig2001, add cnames temporarily [dns] - 10https://gerrit.wikimedia.org/r/467333 (https://phabricator.wikimedia.org/T203521) [13:05:57] PROBLEM - IPMI Sensor Status on notebook1003 is CRITICAL: Return code of 255 is out of bounds [13:07:11] (03CR) 10Jgreen: [C: 032] rename saiph to frpig2001, add cnames temporarily [dns] - 10https://gerrit.wikimedia.org/r/467333 (https://phabricator.wikimedia.org/T203521) (owner: 10Jgreen) [13:07:22] (03Abandoned) 10Marostegui: db-eqiad.php: Repool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467332 (owner: 10Marostegui) [13:07:27] PROBLEM - Check the NTP synchronisation status of timesyncd on notebook1003 is CRITICAL: Return code of 255 is out of bounds [13:07:32] (03PS2) 10Jgreen: rename saiph to frpig2001, add cnames temporarily [dns] - 10https://gerrit.wikimedia.org/r/467333 (https://phabricator.wikimedia.org/T203521) [13:07:51] (03CR) 10Jgreen: [V: 032 C: 032] rename saiph to frpig2001, add cnames temporarily [dns] - 10https://gerrit.wikimedia.org/r/467333 (https://phabricator.wikimedia.org/T203521) (owner: 10Jgreen) [13:09:46] !;og auithdns-update to deploy saiph->frpig2001 rename [13:09:59] (03PS3) 10Giuseppe Lavagetto: mcrouter: allow defining a non-default number of backend connectors [puppet] - 10https://gerrit.wikimedia.org/r/466881 (https://phabricator.wikimedia.org/T203786) [13:10:39] Jeff_Green: that log message didn't go through, semicolon typo! [13:10:51] (03PS6) 10Banyek: mariadb: productionize db2096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466847 (https://phabricator.wikimedia.org/T206593) [13:10:51] oh ha, thanks! [13:10:58] !log auithdns-update to deploy saiph->frpig2001 rename [13:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:36] (03CR) 10jerkins-bot: [V: 04-1] mariadb: productionize db2096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466847 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek) [13:11:41] (03PS1) 10Gehel: motd::script: add type constraints and get rid of compiler warnings [puppet] - 10https://gerrit.wikimedia.org/r/467335 [13:12:05] (03CR) 10Giuseppe Lavagetto: [C: 032] mcrouter: allow defining a non-default number of backend connectors [puppet] - 10https://gerrit.wikimedia.org/r/466881 (https://phabricator.wikimedia.org/T203786) (owner: 10Giuseppe Lavagetto) [13:13:29] (03PS7) 10Banyek: mariadb: productionize db2096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466847 (https://phabricator.wikimedia.org/T206593) [13:13:50] (03PS4) 10Filippo Giunchedi: New class: prometheus::statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/465428 (https://phabricator.wikimedia.org/T205870) [13:13:52] (03PS3) 10Filippo Giunchedi: thumbor: add prometheus-statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/465608 (https://phabricator.wikimedia.org/T205870) [13:14:31] (03CR) 10jerkins-bot: [V: 04-1] mariadb: productionize db2096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466847 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek) [13:16:00] (03PS8) 10Banyek: mariadb: productionize db2096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466847 (https://phabricator.wikimedia.org/T206593) [13:16:48] !log stopping db1092 and db1087 in sync T206743 [13:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:51] T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 [13:17:06] RECOVERY - MD RAID on notebook1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [13:17:17] RECOVERY - dhclient process on notebook1003 is OK: PROCS OK: 0 processes with command name dhclient [13:17:26] RECOVERY - DPKG on notebook1003 is OK: All packages OK [13:17:27] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational [13:17:37] RECOVERY - configured eth on notebook1003 is OK: OK - interfaces up [13:20:36] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:21:34] (03CR) 10Marostegui: [C: 031] mariadb: productionize db2096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466847 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek) [13:22:25] (03CR) 10Banyek: [C: 032] mariadb: productionize db2096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466847 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek) [13:22:46] (03CR) 10Banyek: [V: 032 C: 032] mariadb: productionize db2096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466847 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek) [13:22:50] (03PS58) 10Alex Monk: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [13:22:52] (03CR) 10Alex Monk: Central certificates service (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [13:23:08] (03CR) 10Filippo Giunchedi: wmcs: add prometheus-memcached-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/431595 (https://phabricator.wikimedia.org/T147326) (owner: 10Filippo Giunchedi) [13:23:44] (03CR) 10jerkins-bot: [V: 04-1] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [13:23:47] (03PS1) 10Gehel: monitoring::check_prometheus: fix compiler warnings, adding type constraints [puppet] - 10https://gerrit.wikimedia.org/r/467337 [13:24:25] 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10Wikidata, and 4 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10Marostegui) All the pooled replicas have now compressed tables, can you confirm from your end if this... [13:25:16] (03PS59) 10Alex Monk: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [13:25:24] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for navtiming [puppet] - 10https://gerrit.wikimedia.org/r/466855 (https://phabricator.wikimedia.org/T135991) [13:27:05] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for navtiming [puppet] - 10https://gerrit.wikimedia.org/r/466855 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:27:28] (03PS1) 10Gehel: base::expose_puppet_certs: fix compiler warnings, add type constraints [puppet] - 10https://gerrit.wikimedia.org/r/467338 [13:28:57] PROBLEM - MariaDB Slave Lag: s8 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 498.43 seconds [13:30:50] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T206593: adding db2096 to hosts (and repooling db2069) (duration: 00m 49s) [13:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:53] T206593: Productionize db2096 on x1 - https://phabricator.wikimedia.org/T206593 [13:32:00] !log banyek@deploy1001 Synchronized wmf-config/db-codfw.php: T206593: adding db2096 to hosts (and repooling db2069) (duration: 00m 49s) [13:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:21] (03PS1) 10Anomie: Enable MCR read-new on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467340 (https://phabricator.wikimedia.org/T198308) [13:32:40] (03CR) 10Muehlenhoff: wmcs: add prometheus-memcached-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/431595 (https://phabricator.wikimedia.org/T147326) (owner: 10Filippo Giunchedi) [13:33:06] (03CR) 10Anomie: [C: 032] "Deploying planned config change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467340 (https://phabricator.wikimedia.org/T198308) (owner: 10Anomie) [13:34:10] (03Merged) 10jenkins-bot: Enable MCR read-new on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467340 (https://phabricator.wikimedia.org/T198308) (owner: 10Anomie) [13:35:43] !log anomie@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Setting MCR migration stage to write-both/read-new on Commons (T198308) (duration: 00m 49s) [13:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:47] T198308: Enable MCR migration stage "write both, read new" on live systems - https://phabricator.wikimedia.org/T198308 [13:36:08] RECOVERY - IPMI Sensor Status on notebook1003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [13:36:38] 10Operations, 10Recommendation-API, 10Research, 10Core Platform Team Backlog (Later), and 2 others: Setup access from service to mysql - https://phabricator.wikimedia.org/T205452 (10mobrovac) [13:37:36] RECOVERY - Check the NTP synchronisation status of timesyncd on notebook1003 is OK: OK: synced at Mon 2018-10-15 13:37:30 UTC. [13:40:07] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve test page via mobile-section [13:40:07] t retrieve test page via mobile-sections returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 504 (expecting: 200): /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve all events on January 15) is CRITICAL: Test retrieve all events on January 15 returned the unexpected status 504 (expecting: 200) [13:41:26] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [13:42:45] (03CR) 10jenkins-bot: mariadb: productionize db2096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466847 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek) [13:42:46] (03CR) 10jenkins-bot: Enable MCR read-new on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467340 (https://phabricator.wikimedia.org/T198308) (owner: 10Anomie) [13:44:55] 10Operations, 10Recommendation-API, 10Research, 10Core Platform Team Backlog (Later), and 2 others: Setup access from service to mysql - https://phabricator.wikimedia.org/T205452 (10bmansurov) [13:47:22] PROBLEM - MariaDB Slave Lag: s8 on db1092 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1591.91 seconds [13:47:45] <_joe_> isn't this depooled? [13:47:46] expected? ^^^ [13:48:23] (03CR) 10Volans: [C: 031] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/467335 (owner: 10Gehel) [13:48:54] look at db1092 [13:49:00] on icinga [13:49:03] that is not normal [13:49:29] the silent alarms have dissapear [13:49:39] that must be a bug [13:49:55] <_joe_> jynus: next time just tell me "that has notifications disabled, it should not page" instead of making me navigate icinga's UI [13:50:23] <_joe_> jynus: and I would've told you that the host has notifications disabled, the service does not, apparently [13:50:36] _joe_: look at the rest of the services [13:50:37] <_joe_> so either it was lost or not set [13:50:49] those were lost [13:51:06] I am 100% sure they were set a few hours ago [13:51:06] (03CR) 10Gehel: [C: 04-1] "A few comments inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/466574 (https://phabricator.wikimedia.org/T206639) (owner: 10Mathew.onipe) [13:51:49] (03CR) 10Mathew.onipe: [C: 031] base::expose_puppet_certs: fix compiler warnings, add type constraints [puppet] - 10https://gerrit.wikimedia.org/r/467338 (owner: 10Gehel) [13:51:56] <_joe_> It's pretty easy to get that wrong (disable notification for host or for host and all services), that's why I was asking [13:52:06] <_joe_> if it was set, it should be in icinga events log [13:52:06] no, that happened before [13:52:38] (03PS1) 10Jgreen: rename saiph to frpig2001 for icinga and smokeping [puppet] - 10https://gerrit.wikimedia.org/r/467341 (https://phabricator.wikimedia.org/T203521) [13:52:39] look at Modified Attributes: notifications_enabled [13:52:45] it is disabled on puppet [13:52:52] there is no margin for human error [13:53:14] they have been disabled for weeks [13:53:32] (03CR) 10Jgreen: [C: 032] rename saiph to frpig2001 for icinga and smokeping [puppet] - 10https://gerrit.wikimedia.org/r/467341 (https://phabricator.wikimedia.org/T203521) (owner: 10Jgreen) [13:54:01] cat hieradata/hosts/db1092.yaml [13:54:06] profile::base::notifications_enabled: '0' [13:54:21] jynus: modified attribute means it was modified on the UI compared to the on disk config [13:54:33] exactly, so it must be a bug [13:54:39] because noone changed those [13:55:09] having the first assumption that it's a bug on Icinga doesn't seems the most appropriate way to debug this IMHO [13:56:02] [1539611202] HOST DOWNTIME ALERT: db1092;STOPPED; Host has exited from a period of scheduled downtime [13:56:07] 10Operations, 10Citoid, 10VisualEditor, 10Services (watching), and 2 others: Some requests for DOIs are failing or very slow; if we have a DOI and the request is taking too long, just use CrossRef data instead. - https://phabricator.wikimedia.org/T165105 (10mobrovac) It takes 75 secs for Citoid in producti... [13:56:57] (03PS2) 10Addshore: Enable WBQualityConstraintsSuggestionsBetaFeature in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463439 (https://phabricator.wikimedia.org/T207019) (owner: 10Jonas Kress (WMDE)) [13:58:01] (03PS3) 10Addshore: Enable WBQualityConstraintsSuggestionsBetaFeature on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463439 (https://phabricator.wikimedia.org/T207019) (owner: 10Jonas Kress (WMDE)) [13:58:14] <_joe_> volans: scheduled downtime has nothing to do with enabled/disabled notifications though [13:58:15] (03CR) 10Vgutierrez: [C: 04-1] Central certificates service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [13:59:09] (03PS60) 10Alex Monk: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [14:01:01] (03PS4) 10Addshore: Enable WBQualityConstraintsSuggestionsBetaFeature on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463439 (https://phabricator.wikimedia.org/T207019) (owner: 10Jonas Kress (WMDE)) [14:01:03] (03PS1) 10Addshore: Enable WBQualityConstraintsSuggestionsBetaFeature on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467343 (https://phabricator.wikimedia.org/T207019) [14:02:02] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), 10Services (watching): TEC3:O3:O3.1:Q2 Goal - Move Blubberoid, ZoteroV2, and Graphoid through the production CD Pipeline - https://phabricator.wikimedia.org/T205919 (10mobrovac) [14:04:22] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Track and install additional npm packages for all service container images - https://phabricator.wikimedia.org/T205911 (10mobrovac) [14:04:36] (03CR) 10Mathew.onipe: "Puppet compiler says all good:" [puppet] - 10https://gerrit.wikimedia.org/r/467337 (owner: 10Gehel) [14:04:46] (03CR) 10Volans: [C: 031] "LGTM, double check with the compiler to be on the safe side ;)" [puppet] - 10https://gerrit.wikimedia.org/r/467338 (owner: 10Gehel) [14:05:29] (03CR) 10Mathew.onipe: [C: 031] monitoring::check_prometheus: fix compiler warnings, adding type constraints [puppet] - 10https://gerrit.wikimedia.org/r/467337 (owner: 10Gehel) [14:05:41] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Track and install additional npm packages for all service container images - https://phabricator.wikimedia.org/T205911 (10Joe) My suggestion would be to create a `nodeX... [14:07:27] (03CR) 10Ottomata: [C: 031] profile::analytics::refinery::job::data_purge: move crons to timers [puppet] - 10https://gerrit.wikimedia.org/r/467299 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [14:08:14] (03PS3) 10Elukey: profile::analytics::refinery::job::data_purge: move crons to timers [puppet] - 10https://gerrit.wikimedia.org/r/467299 (https://phabricator.wikimedia.org/T172532) [14:08:24] (03PS1) 10Addshore: Revert "logging: Disable 'Wikibase.NewItemIdFormatter' channel" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467345 [14:08:47] (03CR) 10Addshore: [C: 04-2] "Can revert after wednesday train slot once the new code lands on the servers..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467345 (owner: 10Addshore) [14:09:28] (03CR) 10Elukey: [C: 032] profile::analytics::refinery::job::data_purge: move crons to timers [puppet] - 10https://gerrit.wikimedia.org/r/467299 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [14:13:38] 10Operations, 10netops: relabel switch interfaces formerly saiph.frack.codfw.wmnet to frpig2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T207035 (10Jgreen) [14:13:56] (03PS1) 10Mathew.onipe: cumin: added wdqs-autodeploy alias [puppet] - 10https://gerrit.wikimedia.org/r/467346 [14:14:50] 10Operations, 10ops-codfw: relabel server saiph.frack.codfw.wmnet to frpig2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T207036 (10Jgreen) [14:17:53] (03CR) 10Ottomata: [C: 031] Raise the Yarn Hadoop Resource Manager heap min/max to 4G [puppet] - 10https://gerrit.wikimedia.org/r/467286 (https://phabricator.wikimedia.org/T206943) (owner: 10Elukey) [14:20:50] 10Operations, 10monitoring, 10netops, 10User-fgiunchedi: Backfill librenms data in graphite with historical RRDs - https://phabricator.wikimedia.org/T173698 (10fgiunchedi) 05Open>03declined We're one year of librenms data in Graphite already, I'm declining this since we'll eventually reach librenms ret... [14:20:53] 10Operations, 10monitoring, 10netops, 10Patch-For-Review, 10User-fgiunchedi: Evaluate LibreNMS' Graphite backend - https://phabricator.wikimedia.org/T171167 (10fgiunchedi) [14:20:54] (03CR) 10Gehel: "Puppet compiler agrees this is a noop: https://puppet-compiler.wmflabs.org/compiler1002/12918/" [puppet] - 10https://gerrit.wikimedia.org/r/467337 (owner: 10Gehel) [14:22:26] 10Operations, 10monitoring, 10Graphite, 10User-fgiunchedi: Audit groups of metrics in Graphite that allocate a lot of disk space - https://phabricator.wikimedia.org/T1075 (10fgiunchedi) [14:22:31] 10Operations, 10Analytics, 10monitoring, 10Patch-For-Review: Eventstreams graphite disk usage - https://phabricator.wikimedia.org/T160644 (10fgiunchedi) 05Open>03Resolved We're doing good space wise now: ``` # du -hcs /var/lib/carbon/whisper/eventstreams/ 4.8G /var/lib/carbon/whisper/eventstreams/ ``` [14:24:08] (03PS61) 10Alex Monk: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [14:27:14] 10Operations, 10netops, 10Patch-For-Review: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387 (10faidon) @ayounsi, what's the current status of this task? Last update is from over a year ago, but I think some of our latest woes wit... [14:27:33] 10Operations, 10Wikimedia-Logstash: Add monitoring for detecting when logstash services are down - https://phabricator.wikimedia.org/T141783 (10fgiunchedi) 05Open>03Invalid I don't think we've seen reoccurrence of this, also logstash now has monitoring for udp packet loss which I'm assuming would also show... [14:28:20] (03CR) 10Tarrow: Enable WBQualityConstraintsSuggestionsBetaFeature on wikidatawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467343 (https://phabricator.wikimedia.org/T207019) (owner: 10Addshore) [14:28:32] 10Operations, 10Wikimedia-Logstash: Procure and provision Logging pipeline hardware in multiple datacenters - https://phabricator.wikimedia.org/T205850 (10fgiunchedi) [14:29:01] (03Abandoned) 10Giuseppe Lavagetto: profile::mediawiki::php: add apcu-bc for backwards compatibility [puppet] - 10https://gerrit.wikimedia.org/r/464111 (https://phabricator.wikimedia.org/T201140) (owner: 10Giuseppe Lavagetto) [14:29:45] !log rebooting backup2001 for some tests [14:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:18] (03PS62) 10Alex Monk: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [14:33:36] 10Operations, 10Wikimedia-Logstash: logstash group1 dashboard incorrectly shows testwikidatawiki - https://phabricator.wikimedia.org/T184655 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Checked now, indeed now `testwikidatawiki` is in group0 not group1, resolving. [14:35:38] PROBLEM - MariaDB Slave Lag: s1 on db2055 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.45 seconds [14:35:39] PROBLEM - MariaDB Slave Lag: s1 on db2062 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.96 seconds [14:35:49] PROBLEM - MariaDB Slave Lag: s1 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.56 seconds [14:35:49] PROBLEM - MariaDB Slave Lag: s1 on db2085 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.79 seconds [14:35:49] checking that [14:35:49] PROBLEM - MariaDB Slave Lag: s1 on db2070 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.17 seconds [14:35:55] banyek: ^ [14:36:00] PROBLEM - MariaDB Slave Lag: s1 on db2072 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.39 seconds [14:36:09] PROBLEM - MariaDB Slave Lag: s1 on db2092 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.74 seconds [14:36:14] there is lag on all s1 codfw [14:36:29] PROBLEM - MariaDB Slave Lag: s1 on db2071 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.63 seconds [14:36:33] marostegui: (unrelated to the s1 lag) FYI, we enabled MCR read-new on Commons this morning. In case there are suddenly weird queries/behavior on that wiki. [14:39:13] anomie: thanks [14:39:47] banyek: can you downtime all the s1 codfw hosts? [14:40:53] Amir1: is your script hitting enwiki now? [14:40:57] I see lots of deleting on codfw master [14:40:59] which is lagging [14:41:07] and they are related to your script [14:41:15] (I guess: UPDATE /* DeleteLocalPasswords::processUsers) [14:41:52] that's me [14:42:11] can you stop it? [14:42:19] so we can confirm that is the issue? [14:42:33] I can ease replication consistency options, but I want to confirm it is your script [14:43:25] I assume eqiad slaves have not lagged because they have ssds and db2048 doesn't [14:44:41] (03CR) 10Muehlenhoff: "One comment, but looks good to me otherwise." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465428 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [14:45:06] Amir1: is it stopped? [14:46:26] !log Ease consistency replication options on db2048 to mitigate lag [14:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:58] PROBLEM - MariaDB Slave Lag: s1 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 344.21 seconds [14:47:11] banyek: ping [14:47:37] Amir1: ping [14:48:00] marostegui: stopped now [14:48:12] thanks [14:48:32] let's see if the master recovers once it has catched up [14:48:36] *caught up [14:49:54] Amir1: do you have a timestamp on when it hit enwiki? [14:51:01] here [14:51:07] no unfortunately, I SAL'd the start of it though [14:51:18] the batch size was 200 [14:51:31] smallest possible [14:51:38] banyek: I already downtimed it [14:51:49] tx. had a phone call [14:52:10] Amir1: I guess I can check binlogs [14:52:34] I still see them on db2048, but it is catching up as I eased replication options [14:52:41] consistency options I mean [14:53:16] !log mforns@deploy1001 Started deploy [analytics/refinery@9b288c5]: deploy refinery together with source version 0.0.77 [14:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:48] RECOVERY - MariaDB Slave Lag: s1 on db2072 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [14:53:58] RECOVERY - MariaDB Slave Lag: s1 on db2092 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [14:54:18] RECOVERY - MariaDB Slave Lag: s1 on db2071 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [14:54:22] Amir1: I don't see them anymore, so I am going to restore the defaults for replication consistency [14:54:28] RECOVERY - MariaDB Slave Lag: s1 on db2055 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [14:54:29] RECOVERY - MariaDB Slave Lag: s1 on db2062 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [14:54:38] RECOVERY - MariaDB Slave Lag: s1 on db2094 is OK: OK slave_sql_lag Replication lag: 0.41 seconds [14:54:39] RECOVERY - MariaDB Slave Lag: s1 on db2088 is OK: OK slave_sql_lag Replication lag: 0.23 seconds [14:54:39] RECOVERY - MariaDB Slave Lag: s1 on db2085 is OK: OK slave_sql_lag Replication lag: 0.02 seconds [14:54:48] RECOVERY - MariaDB Slave Lag: s1 on db2070 is OK: OK slave_sql_lag Replication lag: 0.17 seconds [14:55:01] Amir1: You might need to do it some other way, this could have caused lag on eqiad (and thus an incident) if we didn't had SSDs there [14:55:10] maybe reduce the batch? insert an sleep? [14:56:52] It wait for replication :/ [14:56:53] <_joe_> wait for replica? [14:57:52] 10Operations, 10monitoring: Graphite1001 disk usage at 96% - https://phabricator.wikimedia.org/T207040 (10fgiunchedi) [14:57:52] Amir1: does it check all the replicas? [14:57:57] or just same DC replicas? [15:02:10] (03CR) 10Gehel: "puppet compiler agrees this is a noop: https://puppet-compiler.wmflabs.org/compiler1002/12919/" [puppet] - 10https://gerrit.wikimedia.org/r/467338 (owner: 10Gehel) [15:02:35] marostegui: The maintenance script uses the standard call for this, theoretically, it should be handle all of thos [15:04:32] marostegui: https://phabricator.wikimedia.org/source/mediawiki/browse/master/maintenance/includes/DeleteLocalPasswords.php$146 [15:05:50] * banyek away for an hour [15:05:53] Amir1: I trust you, I was just wondering how it works and why it didn't catch db2048 [15:06:34] I don't think it does the cross db replag checking [15:06:40] Because MW isn't aware of the other databases [15:07:06] Reedy: does it checks the one on the master config file (ie db-eqiad.php?) [15:07:21] 10Operations, 10monitoring: Adapt Kafka dashboards to use metrics from prometheus-node-exporter - https://phabricator.wikimedia.org/T207041 (10MoritzMuehlenhoff) [15:07:21] unless something has changed recently, yeah [15:07:31] then that's our answer [15:07:37] We only load one of the db-*.php, so MW has no idea the other databases exist [15:07:41] uh, servers [15:07:47] all eqiad hosts have SSDs, whereas db2048 doesn't [15:08:25] 10Operations, 10monitoring: Adapt Kafka dashboards to use metrics from prometheus-node-exporter - https://phabricator.wikimedia.org/T207041 (10MoritzMuehlenhoff) [15:08:31] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10MoritzMuehlenhoff) [15:09:05] (03CR) 10Alexandros Kosiaris: [C: 031] mediawiki: kill also HHVM on stop_cronjobs [software/spicerack] - 10https://gerrit.wikimedia.org/r/467313 (https://phabricator.wikimedia.org/T207014) (owner: 10Volans) [15:12:05] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, and 2 others: Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606 (10Ottomata) Documentation updated at https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest [15:13:35] !log mforns@deploy1001 Finished deploy [analytics/refinery@9b288c5]: deploy refinery together with source version 0.0.77 (duration: 20m 19s) [15:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:29] !log replacing optics asw2-b fpc2 -fpc8 [15:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:40] (03CR) 10Addshore: Enable WBQualityConstraintsSuggestionsBetaFeature on wikidatawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467343 (https://phabricator.wikimedia.org/T207019) (owner: 10Addshore) [15:18:56] (03PS2) 10Addshore: Enable WBQualityConstraintsSuggestionsBetaFeature on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467343 (https://phabricator.wikimedia.org/T207019) [15:22:10] (03Abandoned) 10Mathew.onipe: wdqs: removed rspec tests [puppet] - 10https://gerrit.wikimedia.org/r/467314 (https://phabricator.wikimedia.org/T204240) (owner: 10Mathew.onipe) [15:24:47] (03CR) 10Gehel: [C: 04-1] cumin: added wdqs-autodeploy alias (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/467346 (owner: 10Mathew.onipe) [15:26:11] (03CR) 10Volans: cumin: added wdqs-autodeploy alias (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/467346 (owner: 10Mathew.onipe) [15:29:03] (03CR) 10Gehel: [C: 04-1] cumin: added wdqs-autodeploy alias (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/467346 (owner: 10Mathew.onipe) [15:29:47] (03CR) 10Gehel: "puppet compiler agrees this is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/467335 (owner: 10Gehel) [15:29:56] (03PS2) 10Gehel: motd::script: add type constraints and get rid of compiler warnings [puppet] - 10https://gerrit.wikimedia.org/r/467335 [15:30:37] (03CR) 10Gehel: [C: 032] motd::script: add type constraints and get rid of compiler warnings [puppet] - 10https://gerrit.wikimedia.org/r/467335 (owner: 10Gehel) [15:31:16] !log restarting slapd on seaborgium as a test for T205463 [15:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:19] T205463: sudo randomly prompts for passwords sometimes in labs instances - https://phabricator.wikimedia.org/T205463 [15:31:38] (03PS2) 10Urbanecm: Add throttle rule for "Night of the Digital Language" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467321 (https://phabricator.wikimedia.org/T206408) [15:32:35] (03PS2) 10Gehel: base::expose_puppet_certs: fix compiler warnings, add type constraints [puppet] - 10https://gerrit.wikimedia.org/r/467338 [15:33:28] (03CR) 10Gehel: [C: 032] base::expose_puppet_certs: fix compiler warnings, add type constraints [puppet] - 10https://gerrit.wikimedia.org/r/467338 (owner: 10Gehel) [15:35:50] 10Operations, 10DBA, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache being empty post-switchover - https://phabricator.wikimedia.org/T206841 (10akosiaris) All this seems pretty correct to me and does explain what we 've experienced pretty well [15:35:52] (03PS6) 10Gehel: relforge: setup 2 instances to validate multi-instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) [15:36:35] (03CR) 10jerkins-bot: [V: 04-1] relforge: setup 2 instances to validate multi-instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) (owner: 10Gehel) [15:39:03] !log Stop MySQL and poweroff db1092 for BBU replacement - T205514 [15:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:06] T205514: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 [15:40:46] 10Operations, 10ops-eqiad, 10netops: asw2-a-eqiad FPC7 faulty PEM0 - https://phabricator.wikimedia.org/T206972 (10ayounsi) PEM is dead, RMA# R200206473 created. [15:44:47] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: Provide a way to have test servers on real hardware, isolated from production - https://phabricator.wikimedia.org/T206636 (10Andrew) We're unlikely to support bare metal in the near future. I would like to try a couple of things, t... [15:45:11] (03CR) 10EBernhardson: relforge: setup 2 instances to validate multi-instance configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) (owner: 10Gehel) [15:50:32] 10Operations, 10cloud-services-team: User:Susannaanas has lost SSH private key in a computer crash - how to recover the developer account? - https://phabricator.wikimedia.org/T207005 (10Susannaanas) 05Open>03Resolved a:03Susannaanas [15:51:20] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), and 2 others: The usual Lag pattern for wdqs2003 seems to be taking another turn - https://phabricator.wikimedia.org/T206423 (10Gehel) [15:51:51] (03PS2) 10Gehel: wdqs: re-enable kafka poller on wdqs public cluster [puppet] - 10https://gerrit.wikimedia.org/r/467331 (https://phabricator.wikimedia.org/T206423) [15:54:31] (03Abandoned) 10Paladox: servermon: Add gunicorn.service systemd script [puppet] - 10https://gerrit.wikimedia.org/r/362455 (owner: 10Paladox) [15:55:16] (03Restored) 10Paladox: servermon: Add gunicorn.service systemd script [puppet] - 10https://gerrit.wikimedia.org/r/362455 (owner: 10Paladox) [15:55:24] (03PS12) 10Paladox: servermon: Add gunicorn.service systemd script [puppet] - 10https://gerrit.wikimedia.org/r/362455 [15:55:51] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review, 10User-Banyek: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 (10Marostegui) 05Open>03Resolved Battery replaced by Chris - thank you!: ``` Battery/Capacitor Count: 1 Battery/Capacitor Status: OK ``` [15:56:14] (03CR) 10jerkins-bot: [V: 04-1] servermon: Add gunicorn.service systemd script [puppet] - 10https://gerrit.wikimedia.org/r/362455 (owner: 10Paladox) [15:58:07] (03PS4) 10Krinkle: errorpages: Use service discovery for statsd in hhvm-fatal-error.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467239 (https://phabricator.wikimedia.org/T206963) [15:58:29] (03PS13) 10Paladox: servermon: Add gunicorn.service systemd script [puppet] - 10https://gerrit.wikimedia.org/r/362455 [15:58:56] (03CR) 10Paladox: servermon: Add gunicorn.service systemd script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/362455 (owner: 10Paladox) [15:59:39] (03Abandoned) 10Paladox: phabricator: Replace mod_php with php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/407958 (https://phabricator.wikimedia.org/T182832) (owner: 10Paladox) [15:59:57] (03Abandoned) 10Paladox: Gerrit: Fix log4j rotating files [puppet] - 10https://gerrit.wikimedia.org/r/434605 (owner: 10Paladox) [16:00:50] (03Abandoned) 10Paladox: Copy wikimedia-polygerrit-style.html to static/gerrit-theme.html [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/439889 (owner: 10Paladox) [16:00:56] (03Abandoned) 10Paladox: Copy GerritSite.css and GerritSiteHeader.html from puppet repo [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/439890 (owner: 10Paladox) [16:02:45] (03PS63) 10Alex Monk: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [16:06:38] (03CR) 10Alex Monk: [C: 04-1] "Per Brandon on IRC, they should both be active, so let's remove the active setting" [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [16:07:48] 10Operations, 10netops, 10Patch-For-Review: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387 (10ayounsi) No real update since a year ago. All switch stacks have been upgraded to a version that doesn't have this specific bug (14.1X... [16:08:05] (03PS5) 10Krinkle: errorpages: Use service discovery for statsd in hhvm-fatal-error.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467239 (https://phabricator.wikimedia.org/T206963) [16:08:27] (03PS64) 10Alex Monk: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [16:09:42] 10Operations, 10Traffic, 10Wikimedia-Incident: Power incident in eqsin - https://phabricator.wikimedia.org/T206861 (10faidon) [16:10:39] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) {F26610134} All mcrouters now use 5 persistent conns to each shard, the above graph... [16:10:56] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10MoritzMuehlenhoff) >>! In T196477#4661422, @MoritzMuehlenhoff wrote: > Linux 4.9.130-1 (which also contains the backport of the H840 Perc controller I made) has now been uploaded to "st... [16:12:40] (03PS1) 10Urbanecm: Add new throttle rule for WMCL Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467408 (https://phabricator.wikimedia.org/T206914) [16:12:42] (03PS1) 10Urbanecm: Add throttle rule for editathon at University of North Carolina at Charlotte [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467409 (https://phabricator.wikimedia.org/T207043) [16:14:55] 10Operations, 10Traffic: ATS production-ready as a backend cache layer - https://phabricator.wikimedia.org/T207048 (10BBlack) p:05Triage>03Normal [16:15:28] 10Operations, 10Traffic: ATS production-ready as a backend cache layer - https://phabricator.wikimedia.org/T207048 (10BBlack) [16:15:31] 10Operations, 10Traffic, 10Patch-For-Review: puppetize http purging for ATS backends - https://phabricator.wikimedia.org/T204208 (10BBlack) [16:15:52] (03CR) 10jerkins-bot: [V: 04-1] Add throttle rule for editathon at University of North Carolina at Charlotte [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467409 (https://phabricator.wikimedia.org/T207043) (owner: 10Urbanecm) [16:17:20] (03PS7) 10Urbanecm: Initial configuration for liwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463479 (https://phabricator.wikimedia.org/T205710) [16:17:52] 10Operations, 10Traffic: Migrate most standard public TLS certificates to CertCentral issuance - https://phabricator.wikimedia.org/T207050 (10BBlack) [16:18:10] <_joe_> !log restart prometheus-mcrouter-exporter.service across the fleet [16:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:13] 10Operations, 10Traffic: Migrate most standard public TLS certificates to CertCentral issuance - https://phabricator.wikimedia.org/T207050 (10BBlack) [16:18:18] 10Operations, 10Certcentral, 10Traffic, 10Goal, 10Patch-For-Review: Deploy a scalable service for ACME (LetsEncrypt) certificate management - https://phabricator.wikimedia.org/T199711 (10BBlack) [16:19:48] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:21:29] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467409 (https://phabricator.wikimedia.org/T207043) (owner: 10Urbanecm) [16:21:43] Another false alarm. From mcrouter failures for 'WANCache:t:commonswiki:gadgets-definition' keys. [16:21:47] The spike has subsided again. [16:21:52] T203786 [16:21:53] T203786: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 [16:21:59] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:22:11] yeah, the peak of conn_yields seems lower than usual [16:22:18] https://grafana.wikimedia.org/dashboard/db/memcache?panelId=38&fullscreen&orgId=1 [16:22:48] now mcrouter runs with 5 proxied conns [16:24:15] Krinkle: did you see https://github.com/facebook/mcrouter/issues/271#issuecomment-429277656 ? [16:30:59] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [16:32:08] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [16:34:15] 10Operations, 10Traffic: Migrate most standard public TLS certificates to CertCentral issuance - https://phabricator.wikimedia.org/T207050 (10Krenair) So, in scope: apt.wikimedia.org archiva.wikimedia.org dumps.wikimedia.org librenms.wikimedia.org lists.wikimedia.org mirrors.wikimedia.org netbox.wikimedia.org... [16:34:53] (03PS3) 10Muehlenhoff: mediawiki::web::prod_sites: convert wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/462494 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [16:36:17] !log replacing pem0 on asw2-a7-eqiad T206972 [16:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:25] T206972: asw2-a-eqiad FPC7 faulty PEM0 - https://phabricator.wikimedia.org/T206972 [16:38:27] (03PS3) 10Smalyshev: Enable tracking lexemes in Updater [puppet] - 10https://gerrit.wikimedia.org/r/467097 [16:39:12] (03CR) 10jerkins-bot: [V: 04-1] Enable tracking lexemes in Updater [puppet] - 10https://gerrit.wikimedia.org/r/467097 (owner: 10Smalyshev) [16:39:25] 10Operations, 10Analytics, 10Traffic, 10User-Elukey: Refactor kafka_config.rb and and kafka_cluster_name.rb in puppet to avoid explicit hiera calls - https://phabricator.wikimedia.org/T177927 (10mforns) p:05Normal>03Low [16:39:29] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:39:56] lovely [16:40:24] (03PS1) 10Urbanecm: Test if logo specified in wgLogo/wgLogoHD exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467412 (https://phabricator.wikimedia.org/T207053) [16:41:10] (03PS4) 10Smalyshev: Enable tracking lexemes in Updater [puppet] - 10https://gerrit.wikimedia.org/r/467097 [16:41:15] (03CR) 10jerkins-bot: [V: 04-1] Test if logo specified in wgLogo/wgLogoHD exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467412 (https://phabricator.wikimedia.org/T207053) (owner: 10Urbanecm) [16:41:58] 10Operations, 10ops-eqiad, 10netops: asw2-a-eqiad FPC7 faulty PEM0 - https://phabricator.wikimedia.org/T206972 (10Cmjohnson) swapped it with one from a spare switch....leaving ticket open to enter RMA details [16:42:29] 10Operations, 10ops-codfw, 10decommission, 10fundraising-tech-ops, 10Patch-For-Review: decom rigel.frack.codfw.wmnet - https://phabricator.wikimedia.org/T202535 (10Jgreen) [16:42:43] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T206651 (10Cmjohnson) 05Open>03declined yes, we know! still working on it...there is a current task already. [16:44:01] 10Operations, 10Core Platform Team Backlog (Watching / External), 10Readers-Web-Backlog (Tracking), 10Services (watching): Create Debian packages for Node.js 10 upgrade - https://phabricator.wikimedia.org/T203239 (10mobrovac) [16:45:59] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:47:34] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom promethium/WMF3571 - https://phabricator.wikimedia.org/T191362 (10Andrew) [16:48:10] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom promethium/WMF3571 - https://phabricator.wikimedia.org/T191362 (10Andrew) [16:49:01] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10Cmjohnson) Dell sent me a 10G NIC and not a raid card. They are rushing one out. [16:49:06] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom promethium/WMF3571 - https://phabricator.wikimedia.org/T191362 (10Andrew) This is now unblocked! Promethium can be decommed at any time. [16:52:38] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint, 10Patch-For-Review: Optimize networking configuration for WDQS - https://phabricator.wikimedia.org/T206105 (10BBlack) Yes, let's look at this today. I think we need better `tg3` ethernet card support in `inter... [16:52:40] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom promethium/WMF3571 - https://phabricator.wikimedia.org/T191362 (10Andrew) Note that everything is weird about this host. It's on a cloud VM network, and isn't monitored by icinga, and is managed by the cloud puppetmaster. So probably most of the... [16:53:06] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom promethium/WMF3571 - https://phabricator.wikimedia.org/T191362 (10Andrew) a:05ssastry>03None [16:54:05] (03PS1) 10Urbanecm: Fix a typo in wgLogoHD (mapwiki => napwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467413 (https://phabricator.wikimedia.org/T207056) [16:54:13] !log Start replication on db1087 and db1092 to avoid them lagging behind the whole night (nothing running there at this time) [16:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:39] (03PS2) 10Urbanecm: Test if logo specified in wgLogo/wgLogoHD exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467412 (https://phabricator.wikimedia.org/T207053) [16:55:48] (03CR) 10jerkins-bot: [V: 04-1] Test if logo specified in wgLogo/wgLogoHD exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467412 (https://phabricator.wikimedia.org/T207053) (owner: 10Urbanecm) [16:57:26] (03PS1) 10Urbanecm: Remove techcomwiki's row in wgLogo, there's no techcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467414 [16:57:44] (03PS3) 10Urbanecm: Test if logo specified in wgLogo/wgLogoHD exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467412 (https://phabricator.wikimedia.org/T207053) [16:58:05] (03PS2) 10Urbanecm: Remove techcomwiki's row in wgLogo, there's no techcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467414 (https://phabricator.wikimedia.org/T207056) [16:58:56] (03CR) 10jerkins-bot: [V: 04-1] Test if logo specified in wgLogo/wgLogoHD exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467412 (https://phabricator.wikimedia.org/T207053) (owner: 10Urbanecm) [16:59:55] (03PS1) 10Smalyshev: Fix lexeme error msgs [puppet] - 10https://gerrit.wikimedia.org/r/467415 (https://phabricator.wikimedia.org/T207030) [17:00:05] gehel: Your horoscope predicts another unfortunate Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181015T1700). [17:00:52] (03CR) 10Smalyshev: [C: 031] wdqs: re-enable kafka poller on wdqs public cluster [puppet] - 10https://gerrit.wikimedia.org/r/467331 (https://phabricator.wikimedia.org/T206423) (owner: 10Gehel) [17:01:21] (03PS4) 10Urbanecm: Test if logo specified in wgLogo/wgLogoHD exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467412 (https://phabricator.wikimedia.org/T207053) [17:02:52] (03CR) 10Ottomata: Add druid_load jobs to analytics refinery (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/465692 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns) [17:03:09] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Allow Analytics team members to restart Turnilo and Superset - https://phabricator.wikimedia.org/T206217 (10Dzahn) approved in SRE meeting [17:03:12] (03PS3) 10Dzahn: admin: add turnilo and superset sudo privs to analytics-admin group [puppet] - 10https://gerrit.wikimedia.org/r/464831 (https://phabricator.wikimedia.org/T206217) (owner: 10Herron) [17:03:16] (03PS3) 10Urbanecm: Remove techcomwiki's row in wgLogo, techcomwiki doesn't exist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467414 (https://phabricator.wikimedia.org/T207056) [17:03:29] (03PS5) 10Urbanecm: Test if logo specified in wgLogo/wgLogoHD exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467412 (https://phabricator.wikimedia.org/T207053) [17:04:04] (03CR) 10Dzahn: [C: 032] "approved in meeting" [puppet] - 10https://gerrit.wikimedia.org/r/464831 (https://phabricator.wikimedia.org/T206217) (owner: 10Herron) [17:06:01] jouncebot, refresh [17:06:02] I refreshed my knowledge about deployments. [17:07:18] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 48.7 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:09:09] (03CR) 10Mforns: Add druid_load jobs to analytics refinery (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/465692 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns) [17:11:29] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is OK: (C)60 le (W)70 le 77.96 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:13:29] (03CR) 10ArielGlenn: "Nice find! I'm happy to merge it through whenever you like." [puppet] - 10https://gerrit.wikimedia.org/r/467415 (https://phabricator.wikimedia.org/T207030) (owner: 10Smalyshev) [17:14:26] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: Provide a way to have test servers on real hardware, isolated from production - https://phabricator.wikimedia.org/T206636 (10Smalyshev) @Andrew the specs should be as close to production as we can get, [[ https://wikitech.wikimedia.... [17:15:06] (03CR) 10Smalyshev: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/467415 (https://phabricator.wikimedia.org/T207030) (owner: 10Smalyshev) [17:16:56] (03PS6) 10Urbanecm: Test if logo specified in wgLogo/wgLogoHD exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467412 (https://phabricator.wikimedia.org/T207053) [17:17:16] (03CR) 10ArielGlenn: [C: 032] Fix lexeme error msgs [puppet] - 10https://gerrit.wikimedia.org/r/467415 (https://phabricator.wikimedia.org/T207030) (owner: 10Smalyshev) [17:17:25] (03PS2) 10ArielGlenn: Fix lexeme error msgs [puppet] - 10https://gerrit.wikimedia.org/r/467415 (https://phabricator.wikimedia.org/T207030) (owner: 10Smalyshev) [17:18:09] RECOVERY - MariaDB Slave Lag: s8 on db1124 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [17:18:15] (03PS5) 10Gehel: Enable tracking lexemes in Updater [puppet] - 10https://gerrit.wikimedia.org/r/467097 (owner: 10Smalyshev) [17:18:56] (03PS1) 10Andrew Bogott: Neutron: default floating IP quota of 0 for new projects [puppet] - 10https://gerrit.wikimedia.org/r/467418 (https://phabricator.wikimedia.org/T206491) [17:20:22] (03CR) 10Gehel: [C: 032] Enable tracking lexemes in Updater [puppet] - 10https://gerrit.wikimedia.org/r/467097 (owner: 10Smalyshev) [17:20:56] (03PS6) 10Gehel: Enable tracking lexemes in Updater [puppet] - 10https://gerrit.wikimedia.org/r/467097 (owner: 10Smalyshev) [17:21:24] (03PS2) 10Andrew Bogott: Neutron: default floating IP quota of 0 for new projects [puppet] - 10https://gerrit.wikimedia.org/r/467418 (https://phabricator.wikimedia.org/T206491) [17:23:12] (03CR) 10Andrew Bogott: [C: 032] Neutron: default floating IP quota of 0 for new projects [puppet] - 10https://gerrit.wikimedia.org/r/467418 (https://phabricator.wikimedia.org/T206491) (owner: 10Andrew Bogott) [17:24:03] damn, andrewbogott is stealing my place in the puppet queue :) [17:24:13] (03PS7) 10Gehel: Enable tracking lexemes in Updater [puppet] - 10https://gerrit.wikimedia.org/r/467097 (owner: 10Smalyshev) [17:24:13] sorry! I'm done for now :) [17:24:30] andrewbogott: np, but thanks! [17:25:30] (03PS3) 10Gehel: wdqs: re-enable kafka poller on wdqs public cluster [puppet] - 10https://gerrit.wikimedia.org/r/467331 (https://phabricator.wikimedia.org/T206423) [17:26:12] (03CR) 10Gehel: [C: 032] wdqs: re-enable kafka poller on wdqs public cluster [puppet] - 10https://gerrit.wikimedia.org/r/467331 (https://phabricator.wikimedia.org/T206423) (owner: 10Gehel) [17:26:49] (03PS4) 10Mforns: Add druid_load jobs to analytics refinery [puppet] - 10https://gerrit.wikimedia.org/r/465692 (https://phabricator.wikimedia.org/T206342) [17:28:42] (03PS1) 10Urbanecm: Test if all logos belongs to existing wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467419 [17:28:50] (03CR) 10Ottomata: Add druid_load jobs to analytics refinery (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465692 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns) [17:29:38] (03CR) 10jerkins-bot: [V: 04-1] Test if all logos belongs to existing wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467419 (owner: 10Urbanecm) [17:30:18] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: Provide a way to have test servers on real hardware, isolated from production - https://phabricator.wikimedia.org/T206636 (10Andrew) How long would you need to test? I can allocate this space in a VM for a little while but most exi... [17:30:45] RECOVERY - MariaDB Slave Lag: s8 on db1092 is OK: OK slave_sql_lag Replication lag: 2.09 seconds [17:31:09] (03PS2) 10Urbanecm: Test if all logos belongs to existing wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467419 (https://phabricator.wikimedia.org/T207064) [17:31:43] (03PS1) 10Gehel: wdqs: auto restart wdqs-updater on config changes [puppet] - 10https://gerrit.wikimedia.org/r/467420 [17:32:07] (03CR) 10jerkins-bot: [V: 04-1] Test if all logos belongs to existing wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467419 (https://phabricator.wikimedia.org/T207064) (owner: 10Urbanecm) [17:32:32] 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10jkim_wikimedia) Hey @cwdent , could you help me with the SSH key? I have my yubikey handy. Thx! [17:33:01] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, and 2 others: Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606 (10Nuria) 05Open>03Resolved [17:33:19] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) is CRITICAL: Test Retrieve all events for Jan 15 returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 re [17:33:19] ed body (AttributeError: NoneType object has no attribute get) [17:33:34] (03PS3) 10Urbanecm: Test if all logos belongs to existing wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467419 (https://phabricator.wikimedia.org/T207064) [17:34:06] (03PS7) 10Urbanecm: Test if logo specified in wgLogo/wgLogoHD exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467412 (https://phabricator.wikimedia.org/T207053) [17:34:22] (03CR) 10Krinkle: Test if all logos belongs to existing wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467419 (https://phabricator.wikimedia.org/T207064) (owner: 10Urbanecm) [17:34:25] (03PS4) 10Urbanecm: Test if all logos belong to existing wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467419 (https://phabricator.wikimedia.org/T207064) [17:34:29] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [17:34:30] (03CR) 10jerkins-bot: [V: 04-1] Test if all logos belong to existing wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467419 (https://phabricator.wikimedia.org/T207064) (owner: 10Urbanecm) [17:34:32] (03CR) 10Krinkle: [C: 031] Test if all logos belong to existing wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467419 (https://phabricator.wikimedia.org/T207064) (owner: 10Urbanecm) [17:34:37] (03CR) 10Krinkle: [C: 031] Test if logo specified in wgLogo/wgLogoHD exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467412 (https://phabricator.wikimedia.org/T207053) (owner: 10Urbanecm) [17:35:08] 10Operations, 10Fundraising-Backlog, 10fundraising-tech-ops: reports.frdev.wm.o -- still in use? - https://phabricator.wikimedia.org/T170640 (10Jgreen) 05Open>03Resolved This site is removed. I left most of the puppet code intact so we can, in theory, bring it back up quickly if we decide to do so. [17:35:49] (03CR) 10jerkins-bot: [V: 04-1] Test if all logos belong to existing wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467419 (https://phabricator.wikimedia.org/T207064) (owner: 10Urbanecm) [17:36:17] (03PS5) 10Urbanecm: Test if all logos belong to existing wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467419 (https://phabricator.wikimedia.org/T207064) [17:37:14] (03CR) 10jerkins-bot: [V: 04-1] Test if all logos belong to existing wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467419 (https://phabricator.wikimedia.org/T207064) (owner: 10Urbanecm) [17:37:20] (03PS6) 10Urbanecm: Test if all logos belong to existing wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467419 (https://phabricator.wikimedia.org/T207064) [17:38:17] (03CR) 10jerkins-bot: [V: 04-1] Test if all logos belong to existing wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467419 (https://phabricator.wikimedia.org/T207064) (owner: 10Urbanecm) [17:39:50] (03PS1) 10Urbanecm: Delete HD logos for non-existing project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467421 (https://phabricator.wikimedia.org/T207066) [17:40:51] (03PS2) 10Urbanecm: cswikivoyage has HD logo even the project doesn't exist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467421 (https://phabricator.wikimedia.org/T207066) [17:41:10] (03PS5) 10Mforns: Add druid_load jobs to analytics refinery [puppet] - 10https://gerrit.wikimedia.org/r/465692 (https://phabricator.wikimedia.org/T206342) [17:42:12] (03CR) 10Mforns: "OK, now please hold on merging until refinery-source changes have been deployed. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/465692 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns) [17:46:34] (03PS1) 10Urbanecm: Add viwikimedia to DNS [dns] - 10https://gerrit.wikimedia.org/r/467425 (https://phabricator.wikimedia.org/T207052) [17:47:36] (03PS6) 10Mforns: Add druid_load jobs to analytics refinery [puppet] - 10https://gerrit.wikimedia.org/r/465692 (https://phabricator.wikimedia.org/T206342) [17:52:39] (03PS1) 10Ladsgroup: ores: puppet config for redis task tracker [puppet] - 10https://gerrit.wikimedia.org/r/467428 (https://phabricator.wikimedia.org/T152012) [17:52:55] 10Operations, 10netops, 10Patch-For-Review: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387 (10faidon) I just looked briefly at T172459 and it looks like the last update there was to attempt this during the switchover period whic... [17:54:03] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Allow Analytics team members to restart Turnilo and Superset - https://phabricator.wikimedia.org/T206217 (10Dzahn) 05Open>03Resolved a:03Dzahn This should resolve the ticket. Please reopen if something doesn't work. [17:55:27] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@ff3bf90]: Test deployment - GUI update and new Updater build(wdqs1009) [17:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:59] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [17:57:28] 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10cwdent) @jkim_wikimedia these are the basic instructions for making an ssh key: https://wikitech.wikimedia.org/wiki... [17:57:31] 10Operations, 10DNS, 10Traffic, 10User-Urbanecm: Mobile DNS entry for vewikimedia is missing - https://phabricator.wikimedia.org/T207069 (10Urbanecm) [17:57:37] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@ff3bf90]: Test deployment - GUI update and new Updater build(wdqs1009) (duration: 02m 10s) [17:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:22] (03PS1) 10Urbanecm: Add ve.m.wikimedia.org to DNS [dns] - 10https://gerrit.wikimedia.org/r/467429 (https://phabricator.wikimedia.org/T207069) [17:58:59] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review, 10User-Urbanecm: Mobile DNS entry for vewikimedia is missing - https://phabricator.wikimedia.org/T207069 (10Urbanecm) p:05Triage>03Low [17:59:09] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [17:59:25] (03PS1) 10Addshore: wgLexemeEnableSenses true for testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467430 (https://phabricator.wikimedia.org/T203887) [17:59:26] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@ff3bf90]: Test deployment - GUI update and new Updater build(wdqs1009) [17:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] Deploy window Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181015T1800) [18:00:04] addshore, Urbanecm, tgr, and Amir1: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:19] here [18:00:26] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review, 10User-Urbanecm: Mobile DNS entry for vewikimedia is missing - https://phabricator.wikimedia.org/T207069 (10Dzahn) a:05Urbanecm>03Dzahn [18:00:57] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review, 10User-Urbanecm: Mobile DNS entry for vewikimedia is missing - https://phabricator.wikimedia.org/T207069 (10Dzahn) is MobileFrontend enabled? [18:01:05] o/ [18:01:13] \o [18:01:17] o/ [18:01:36] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@ff3bf90]: Test deployment - GUI update and new Updater build(wdqs1009) (duration: 02m 11s) [18:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:43] Who wants to do what then? :D [18:01:52] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review, 10User-Urbanecm: Mobile DNS entry for vewikimedia is missing - https://phabricator.wikimedia.org/T207069 (10Urbanecm) Yes, per https://wikimedia.org.ve/wiki/Especial:Versi%C3%B3n. [18:01:57] I can start with mine, but I won't be able to hang around and deploy everything [18:02:07] I don't have deploy privs [18:02:42] (03CR) 10Addshore: [C: 032] Enable WBQualityConstraintsSuggestionsBetaFeature on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463439 (https://phabricator.wikimedia.org/T207019) (owner: 10Jonas Kress (WMDE)) [18:02:48] (03CR) 10Dzahn: [C: 032] Add ve.m.wikimedia.org to DNS [dns] - 10https://gerrit.wikimedia.org/r/467429 (https://phabricator.wikimedia.org/T207069) (owner: 10Urbanecm) [18:03:33] I can do the rest [18:03:44] (03PS5) 10Addshore: Enable WBQualityConstraintsSuggestionsBetaFeature on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463439 (https://phabricator.wikimedia.org/T207019) (owner: 10Jonas Kress (WMDE)) [18:03:48] (03CR) 10Addshore: [C: 032] Enable WBQualityConstraintsSuggestionsBetaFeature on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463439 (https://phabricator.wikimedia.org/T207019) (owner: 10Jonas Kress (WMDE)) [18:04:14] (03PS3) 10Addshore: Enable WBQualityConstraintsSuggestionsBetaFeature on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467343 (https://phabricator.wikimedia.org/T207019) [18:04:23] tgr: great thanks! [18:04:53] (03Merged) 10jenkins-bot: Enable WBQualityConstraintsSuggestionsBetaFeature on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463439 (https://phabricator.wikimedia.org/T207019) (owner: 10Jonas Kress (WMDE)) [18:06:52] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review, 10User-Urbanecm: Mobile DNS entry for vewikimedia is missing - https://phabricator.wikimedia.org/T207069 (10Dzahn) ``` host ve.m.wikimedia.org ve.m.wikimedia.org has address 208.80.154.224 ve.m.wikimedia.org has IPv6 address 2620:0:861:ed1a::1 ``` htt... [18:07:04] Urbanecm: http://ve.m.wikimedia.org/ exists now and redirects me (not on mobile) [18:08:50] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: Provide a way to have test servers on real hardware, isolated from production - https://phabricator.wikimedia.org/T206636 (10Smalyshev) > How long would you need to test? Well, it depends. Generally we have a number of tests that I... [18:09:51] thanks mutante [18:10:05] !log addshore@deploy1001 Synchronized wmf-config/Wikibase-production.php: SWAT: T207019 Enable WBQualityConstraintsSuggestionsBetaFeature on testwikidatawiki (duration: 00m 49s) [18:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:08] T207019: Deploy Beta feature for constraint suggestions - https://phabricator.wikimedia.org/T207019 [18:10:26] (03PS2) 10Addshore: wgLexemeEnableSenses true for testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467430 (https://phabricator.wikimedia.org/T203887) [18:10:32] (03CR) 10Addshore: [C: 032] wgLexemeEnableSenses true for testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467430 (https://phabricator.wikimedia.org/T203887) (owner: 10Addshore) [18:10:59] welcome! resolved then? [18:11:09] (03PS3) 10Addshore: wgLexemeEnableSenses true for testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467430 (https://phabricator.wikimedia.org/T203887) [18:11:13] (03CR) 10Addshore: [C: 032] wgLexemeEnableSenses true for testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467430 (https://phabricator.wikimedia.org/T203887) (owner: 10Addshore) [18:11:44] no, invalid :D. We both oversaw one little "detail", the wiki is non-WMF and ve.wikimedia.org is redirect. Sorry, I should doublecheck before submitting a patch :D [18:12:22] (03Merged) 10jenkins-bot: wgLexemeEnableSenses true for testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467430 (https://phabricator.wikimedia.org/T203887) (owner: 10Addshore) [18:12:39] Urbanecm: lol, ok. i thought something about the redirect looking uncommon [18:12:50] well.. then we should remove it [18:12:51] 10Operations, 10DBA, 10MediaWiki-Database, 10Performance-Team, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) The next step will be to stop replication of pc1004 from pc2004 and then run the following code in a screen in pc2004: ``` for TABLE in $(... [18:13:04] yes, I'm sorry [18:13:07] no problem [18:13:33] it's still good in a way.. either have both in DNS or none [18:13:36] better than mixed [18:13:49] (03PS1) 10Pmiazga: Beta: Show share button on mobile web for beta user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467437 (https://phabricator.wikimedia.org/T181195) [18:13:51] i took this because i rememberd the old ticket about some random ones missing .m. [18:14:43] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT enable senses on testwikidatawiki T203887 (duration: 00m 49s) [18:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:46] T203887: Turn on Sense support on test.wikidata.org - https://phabricator.wikimedia.org/T203887 [18:14:50] tgr: its all yours [18:15:32] (03CR) 10Pmiazga: [C: 032] Beta: Show share button on mobile web for beta user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467437 (https://phabricator.wikimedia.org/T181195) (owner: 10Pmiazga) [18:15:44] (03PS2) 10Gergő Tisza: Fix a typo in wgLogoHD (mapwiki => napwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467413 (https://phabricator.wikimedia.org/T207056) (owner: 10Urbanecm) [18:16:09] (03PS4) 10Gergő Tisza: Remove techcomwiki's row in wgLogo, techcomwiki doesn't exist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467414 (https://phabricator.wikimedia.org/T207056) (owner: 10Urbanecm) [18:16:29] (03CR) 10jenkins-bot: Enable WBQualityConstraintsSuggestionsBetaFeature on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463439 (https://phabricator.wikimedia.org/T207019) (owner: 10Jonas Kress (WMDE)) [18:16:31] (03CR) 10jenkins-bot: wgLexemeEnableSenses true for testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467430 (https://phabricator.wikimedia.org/T203887) (owner: 10Addshore) [18:16:39] Urbanecm: I'm deploying these two together as I assume the techcomwiki one can't really be tested [18:16:40] (03Merged) 10jenkins-bot: Beta: Show share button on mobile web for beta user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467437 (https://phabricator.wikimedia.org/T181195) (owner: 10Pmiazga) [18:16:53] (03CR) 10jenkins-bot: Beta: Show share button on mobile web for beta user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467437 (https://phabricator.wikimedia.org/T181195) (owner: 10Pmiazga) [18:16:59] tgr, yes, you're right [18:17:01] (03CR) 10Gergő Tisza: [C: 032] Fix a typo in wgLogoHD (mapwiki => napwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467413 (https://phabricator.wikimedia.org/T207056) (owner: 10Urbanecm) [18:17:08] (03CR) 10Gergő Tisza: [C: 032] Remove techcomwiki's row in wgLogo, techcomwiki doesn't exist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467414 (https://phabricator.wikimedia.org/T207056) (owner: 10Urbanecm) [18:18:09] (03Merged) 10jenkins-bot: Fix a typo in wgLogoHD (mapwiki => napwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467413 (https://phabricator.wikimedia.org/T207056) (owner: 10Urbanecm) [18:18:13] (03Merged) 10jenkins-bot: Remove techcomwiki's row in wgLogo, techcomwiki doesn't exist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467414 (https://phabricator.wikimedia.org/T207056) (owner: 10Urbanecm) [18:24:37] looks like looks like https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/467437 was merged but not deployed [18:24:53] it's beta-only so I'll add it to the scap [18:25:42] Urbanecm: you can test on mwdebug1002 [18:27:50] tgr, seems to work [18:28:53] 10Operations: Onboarding Cas Rusnov - https://phabricator.wikimedia.org/T207009 (10Volans) [18:30:31] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: [[gerrit:467437|Beta: Show share button on mobile web for beta user]] (no-op) (duration: 00m 49s) [18:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:13] I think for labs-only patch, rebase on deploy node is enough [18:31:21] but releng knows better [18:33:22] (03PS2) 10Gergő Tisza: Disable AICaptcha data collection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467139 (https://phabricator.wikimedia.org/T186244) [18:33:39] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@ff3bf90]: GUI updates and new Updater build [18:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:52] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:467413|Fix a typo in wgLogoHD (mapwiki => napwiki) T207056]], [[gerrit:467414|Remove techcomwikis row in wgLogo, techcomwiki doesnt exist T207056]] (duration: 00m 48s) [18:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:56] T207056: Some logos added to wgLogoHD aren't uploaded to the server - https://phabricator.wikimedia.org/T207056 [18:34:02] (03CR) 10jenkins-bot: Fix a typo in wgLogoHD (mapwiki => napwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467413 (https://phabricator.wikimedia.org/T207056) (owner: 10Urbanecm) [18:34:04] (03CR) 10jenkins-bot: Remove techcomwiki's row in wgLogo, techcomwiki doesn't exist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467414 (https://phabricator.wikimedia.org/T207056) (owner: 10Urbanecm) [18:34:40] Amir1: that'd work too. As long as I don't get confusing git diffs. [18:34:44] Urbanecm: deployed [18:34:48] thx [18:35:02] (03CR) 10Gergő Tisza: [C: 032] Disable AICaptcha data collection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467139 (https://phabricator.wikimedia.org/T186244) (owner: 10Gergő Tisza) [18:35:41] will you deploy the last patch (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/467421) tgr? [18:36:05] (03Merged) 10jenkins-bot: Disable AICaptcha data collection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467139 (https://phabricator.wikimedia.org/T186244) (owner: 10Gergő Tisza) [18:36:33] oops, sorry, need to reload the deployments page more often [18:36:40] will do that next [18:37:00] (03PS3) 10Gergő Tisza: cswikivoyage has HD logo even the project doesn't exist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467421 (https://phabricator.wikimedia.org/T207066) (owner: 10Urbanecm) [18:39:20] thx [18:39:47] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:467139|Disable AICaptcha data collection (T186244)]] (duration: 00m 49s) [18:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:51] T186244: Deploy AICaptcha data collection - https://phabricator.wikimedia.org/T186244 [18:39:59] (03CR) 10Gergő Tisza: [C: 032] cswikivoyage has HD logo even the project doesn't exist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467421 (https://phabricator.wikimedia.org/T207066) (owner: 10Urbanecm) [18:41:05] (03Merged) 10jenkins-bot: cswikivoyage has HD logo even the project doesn't exist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467421 (https://phabricator.wikimedia.org/T207066) (owner: 10Urbanecm) [18:41:25] (03PS2) 10Urbanecm: Add viwikimedia to DNS [dns] - 10https://gerrit.wikimedia.org/r/467425 (https://phabricator.wikimedia.org/T207052) [18:42:19] Urbanecm: on mwdebug1002 [18:42:44] tgr, it works, please deploy [18:43:07] (03PS8) 10Urbanecm: Test if logo specified in wgLogo/wgLogoHD exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467412 (https://phabricator.wikimedia.org/T207053) [18:43:27] (03PS7) 10Urbanecm: Test if all logos belong to existing wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467419 (https://phabricator.wikimedia.org/T207064) [18:44:21] (03CR) 10jerkins-bot: [V: 04-1] Test if all logos belong to existing wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467419 (https://phabricator.wikimedia.org/T207064) (owner: 10Urbanecm) [18:44:29] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:467421|cswikivoyage has HD logo even the project doesnt exist (T207066)]] (duration: 00m 49s) [18:44:30] (03PS3) 10Gergő Tisza: Fix Sentry DSN setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465672 (https://phabricator.wikimedia.org/T206589) [18:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:32] T207066: cswikivoyage has HD logo even the project doesn't exist - https://phabricator.wikimedia.org/T207066 [18:45:26] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review, 10User-Urbanecm: Mobile DNS entry for vewikimedia is missing - https://phabricator.wikimedia.org/T207069 (10Urbanecm) 05Open>03Resolved [18:46:26] (03PS8) 10Urbanecm: Test if all logos belong to existing wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467419 (https://phabricator.wikimedia.org/T207064) [18:47:17] (03CR) 10jerkins-bot: [V: 04-1] Test if all logos belong to existing wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467419 (https://phabricator.wikimedia.org/T207064) (owner: 10Urbanecm) [18:47:20] 10Operations: Onboarding Cas Rusnov - https://phabricator.wikimedia.org/T207009 (10Dzahn) [18:47:29] (03CR) 10Gergő Tisza: "Please leave a note to deployers if merge something to mediawiki-config or make sure it gets deployed (to mwdeploy1001 at least, the rest " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467437 (https://phabricator.wikimedia.org/T181195) (owner: 10Pmiazga) [18:47:36] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@ff3bf90]: GUI updates and new Updater build (duration: 13m 57s) [18:47:36] 10Operations: Onboarding Cas Rusnov - https://phabricator.wikimedia.org/T207009 (10Dzahn) subscribed to ops and ops-private [18:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:02] (03CR) 10Gergő Tisza: [C: 032] Fix Sentry DSN setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465672 (https://phabricator.wikimedia.org/T206589) (owner: 10Gergő Tisza) [18:49:04] (03Merged) 10jenkins-bot: Fix Sentry DSN setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465672 (https://phabricator.wikimedia.org/T206589) (owner: 10Gergő Tisza) [18:49:09] (03PS9) 10Urbanecm: Test if all logos belong to existing wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467419 (https://phabricator.wikimedia.org/T207064) [18:49:43] (03CR) 10jenkins-bot: Disable AICaptcha data collection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467139 (https://phabricator.wikimedia.org/T186244) (owner: 10Gergő Tisza) [18:49:45] (03CR) 10jenkins-bot: cswikivoyage has HD logo even the project doesn't exist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467421 (https://phabricator.wikimedia.org/T207066) (owner: 10Urbanecm) [18:49:47] (03CR) 10jenkins-bot: Fix Sentry DSN setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465672 (https://phabricator.wikimedia.org/T206589) (owner: 10Gergő Tisza) [18:49:49] (03PS2) 10Gergő Tisza: Enable reading from new backend of change_tag in s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467315 (https://phabricator.wikimedia.org/T194164) (owner: 10Ladsgroup) [18:49:59] (03CR) 10jerkins-bot: [V: 04-1] Test if all logos belong to existing wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467419 (https://phabricator.wikimedia.org/T207064) (owner: 10Urbanecm) [18:50:52] 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10jkim_wikimedia) Hey @cwdent, here's the public key: /Users/jkim/.ssh/id_rsa.pub What is the output of the yubikey?... [18:51:30] chaomodus: do you already know wikitech wiki ?:) [18:51:31] !log pulled gerrit 467315 to mwdeploy1001 (no-op, no scap needed) [18:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:51] (03CR) 10Gergő Tisza: [C: 032] Enable reading from new backend of change_tag in s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467315 (https://phabricator.wikimedia.org/T194164) (owner: 10Ladsgroup) [18:52:42] (03PS1) 10BBlack: interface::rps: support tg3 properly [puppet] - 10https://gerrit.wikimedia.org/r/467443 [18:52:55] (03Merged) 10jenkins-bot: Enable reading from new backend of change_tag in s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467315 (https://phabricator.wikimedia.org/T194164) (owner: 10Ladsgroup) [18:53:06] (03CR) 10jerkins-bot: [V: 04-1] interface::rps: support tg3 properly [puppet] - 10https://gerrit.wikimedia.org/r/467443 (owner: 10BBlack) [18:53:30] (03PS2) 10BBlack: interface::rps: support tg3 properly [puppet] - 10https://gerrit.wikimedia.org/r/467443 (https://phabricator.wikimedia.org/T206105) [18:53:56] Amir1: patch is on mwdebug1002 [18:54:02] cool [18:54:05] (03CR) 10jerkins-bot: [V: 04-1] interface::rps: support tg3 properly [puppet] - 10https://gerrit.wikimedia.org/r/467443 (https://phabricator.wikimedia.org/T206105) (owner: 10BBlack) [18:54:40] (03PS3) 10BBlack: interface::rps: support tg3 properly [puppet] - 10https://gerrit.wikimedia.org/r/467443 (https://phabricator.wikimedia.org/T206105) [18:55:55] tgr: works fine, please proceed if logs are clean [18:57:16] 10Operations: Onboarding Cas Rusnov - https://phabricator.wikimedia.org/T207009 (10Dzahn) [18:57:58] 10Operations: Onboarding Cas Rusnov - https://phabricator.wikimedia.org/T207009 (10Dzahn) added to wmf and ops LDAP groups. adding to puppet repo as ldap_only admin (temp, until we have root shell added) to avoid alerts about unsynced LDAP <-> Puppet admin module [18:59:01] (03PS4) 10GTirloni: labstore: make nfsd-ldap package required for jessie, but not stretch [puppet] - 10https://gerrit.wikimedia.org/r/466990 (owner: 10Bstorm) [18:59:14] 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10cwdent) @jkim_wikimedia If you plug it into your laptop and touch the button it will spit out some text. The first... [18:59:29] !log LDAP - added crusnov to wmf and ops groups [18:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:48] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:467315|Enable reading from new backend of change_tag in s7 (T194164)]] (duration: 00m 49s) [18:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:51] T194164: Start reading from change_tag_def in production - https://phabricator.wikimedia.org/T194164 [18:59:58] Amir1: deployed [19:00:02] (03PS1) 10Urbanecm: Add tests for wg(Canonical)Server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467445 (https://phabricator.wikimedia.org/T207073) [19:00:14] Thanks [19:00:22] I'm looking at logs [19:00:48] Dear SRE: if anything went wrong on s7, please revert my patch [19:01:11] (03CR) 10jerkins-bot: [V: 04-1] Add tests for wg(Canonical)Server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467445 (https://phabricator.wikimedia.org/T207073) (owner: 10Urbanecm) [19:02:13] (03CR) 10GTirloni: [C: 032] labstore: make nfsd-ldap package required for jessie, but not stretch [puppet] - 10https://gerrit.wikimedia.org/r/466990 (owner: 10Bstorm) [19:02:37] 10Operations: Onboarding Cas Rusnov - https://phabricator.wikimedia.org/T207009 (10Volans) [19:03:36] (03CR) 10Gehel: [C: 031] "It looks like the code does what it says it does. And it make sense to me. (But honestly, not really my area of expertise)." [puppet] - 10https://gerrit.wikimedia.org/r/467443 (https://phabricator.wikimedia.org/T206105) (owner: 10BBlack) [19:04:02] (03PS1) 10Dzahn: admins: add Cas Rusnov to admins as ldap_only [puppet] - 10https://gerrit.wikimedia.org/r/467447 (https://phabricator.wikimedia.org/T207009) [19:05:05] (03CR) 10Dzahn: [C: 032] admins: add Cas Rusnov to admins as ldap_only [puppet] - 10https://gerrit.wikimedia.org/r/467447 (https://phabricator.wikimedia.org/T207009) (owner: 10Dzahn) [19:06:16] mutante: ahahah ^ lol [19:07:09] 10Operations, 10Patch-For-Review: Onboarding Cas Rusnov - https://phabricator.wikimedia.org/T207009 (10Dzahn) [19:07:57] volans: yea, i just want to avoid the warning mail and dont have the time to get to root shell right away and still needs a key [19:08:23] yeah I want to go over that patch with him to see how to do that [19:08:29] that's fine as is, thanks for the help [19:08:29] i did LDAP but if i don't keep it in sync with puppet that's a no-no [19:08:36] so this seemed easy enough [19:08:44] which part of LDAP? [19:08:48] wmf and ops [19:08:52] mmmh [19:08:59] that actually requires the meeting [19:08:59] what other part ? [19:09:06] doesn't? [19:09:14] no, just the root shell does [19:09:19] imho [19:09:40] yeah but now if we add the ssh user it will be root from the start [19:09:40] (03CR) 10Framawiki: [C: 031] Test if logo specified in wgLogo/wgLogoHD exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467412 (https://phabricator.wikimedia.org/T207053) (owner: 10Urbanecm) [19:09:46] right? [19:10:50] no, it doesn't have to be root [19:11:03] it would just move from the "ldap_only" section [19:11:04] ah right that's in data.yaml [19:11:10] to the shell user section [19:11:28] !log mforns@deploy1001 Started deploy [analytics/refinery@1fc53d9]: deploy refinery together with source version 0.0.78 [19:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:53] right, it could warn that ops group membership doesnt match if we dont [19:12:00] fixing it, actually, heh [19:12:21] (03PS1) 10Urbanecm: [tests] Ensure only existing wikis are referenced from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467448 [19:12:49] volans: yea, fixed. only "wmf" now. that's the cleanest and icinga login (and other) works either way [19:13:01] ack [19:13:17] (03CR) 10jerkins-bot: [V: 04-1] [tests] Ensure only existing wikis are referenced from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467448 (owner: 10Urbanecm) [19:13:28] 10Operations, 10Patch-For-Review: Onboarding Cas Rusnov - https://phabricator.wikimedia.org/T207009 (10Dzahn) [19:15:18] he confirmed Icinga works :) and i gotta run for now, bbl [19:15:30] (03PS2) 10Urbanecm: Add tests for wg(Canonical)Server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467445 (https://phabricator.wikimedia.org/T207073) [19:16:20] (03CR) 10jerkins-bot: [V: 04-1] Add tests for wg(Canonical)Server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467445 (https://phabricator.wikimedia.org/T207073) (owner: 10Urbanecm) [19:16:32] (03Abandoned) 10Urbanecm: Test if all logos belong to existing wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467419 (https://phabricator.wikimedia.org/T207064) (owner: 10Urbanecm) [19:17:25] 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10jkim_wikimedia) @cwdent ccccccgdrdri [19:18:16] (03PS2) 10Urbanecm: [tests] Ensure only existing wikis are referenced from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467448 [19:19:01] (03PS3) 10Urbanecm: [tests] Ensure only existing wikis are referenced from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467448 [19:20:03] (03CR) 10jerkins-bot: [V: 04-1] [tests] Ensure only existing wikis are referenced from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467448 (owner: 10Urbanecm) [19:20:11] (03PS3) 10Urbanecm: Add tests for wg(Canonical)Server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467445 (https://phabricator.wikimedia.org/T207073) [19:21:03] (03CR) 10jerkins-bot: [V: 04-1] Add tests for wg(Canonical)Server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467445 (https://phabricator.wikimedia.org/T207073) (owner: 10Urbanecm) [19:22:12] (03PS4) 10Urbanecm: [tests] Ensure only existing wikis are referenced from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467448 [19:23:01] (03CR) 10jerkins-bot: [V: 04-1] [tests] Ensure only existing wikis are referenced from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467448 (owner: 10Urbanecm) [19:23:16] 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10cwdent) @jkim_wikimedia thanks! As far as the public key I need the actual contents of the file which you can see... [19:23:30] WMF sites loading really slowly for anyone else? [19:23:42] (03PS5) 10Urbanecm: [tests] Ensure only existing wikis are referenced from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467448 [19:23:53] (03PS6) 10Urbanecm: [tests] Ensure only existing wikis are referenced from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467448 (https://phabricator.wikimedia.org/T207074) [19:23:58] musikanimal: https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue [19:24:44] (03CR) 10jerkins-bot: [V: 04-1] [tests] Ensure only existing wikis are referenced from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467448 (https://phabricator.wikimedia.org/T207074) (owner: 10Urbanecm) [19:25:29] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=99%) [19:25:33] (03PS7) 10Urbanecm: [tests] Ensure only existing wikis are referenced from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467448 [19:26:20] ottomata: ^^^ an-coord1001 [19:26:25] (03CR) 10jerkins-bot: [V: 04-1] [tests] Ensure only existing wikis are referenced from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467448 (owner: 10Urbanecm) [19:27:25] !log mforns@deploy1001 Finished deploy [analytics/refinery@1fc53d9]: deploy refinery together with source version 0.0.78 (duration: 15m 56s) [19:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:39] RECOVERY - Disk space on an-coord1001 is OK: DISK OK [19:28:17] (03PS8) 10Urbanecm: [tests] Ensure only existing wikis are referenced from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467448 [19:29:07] (03CR) 10jerkins-bot: [V: 04-1] [tests] Ensure only existing wikis are referenced from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467448 (owner: 10Urbanecm) [19:30:19] 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10jkim_wikimedia) ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQC+ItHLDwXZYoK8b3LEff1bM6UydLGFXMCprg+LVLkwDR4fQFSEMNMLAsdNoXn... [19:32:09] PROBLEM - High lag on wdqs1010 is CRITICAL: 2.069e+05 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:32:22] !log mforns@deploy1001 Started deploy [analytics/refinery@3f4adf8]: deploy refinery together with source version 0.0.78 without all removed old jars [19:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:10] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@ff3bf90]: Redeploy 1010 [19:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:11] legoktm: thanks. Appears to be just my ISP, but oddly only WMF sites. pinging I get some 30% packet loss. Maybe they know I spend too much time on the wiki! [19:33:38] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@ff3bf90]: Redeploy 1010 (duration: 00m 28s) [19:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:49] musikanimal: you're not the only one who uses your ISP :) filing a task with a traceroute will be helpful to netops [19:34:13] I don't want to post my IP though... can I scrub that out? [19:35:25] (03CR) 10Krinkle: [C: 031] "LGTM. Thanks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467345 (owner: 10Addshore) [19:36:00] okay nvm, other things aren't working, so not just WMF sites [19:37:41] !log mforns@deploy1001 Finished deploy [analytics/refinery@3f4adf8]: deploy refinery together with source version 0.0.78 without all removed old jars (duration: 05m 18s) [19:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:22] (03PS9) 10Urbanecm: [tests] Ensure only existing wikis are referenced from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467448 [19:44:12] (03CR) 10jerkins-bot: [V: 04-1] [tests] Ensure only existing wikis are referenced from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467448 (owner: 10Urbanecm) [19:49:42] (03PS1) 10Urbanecm: Fix typo in IS.php: use ltwiki instead of ltwikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467453 (https://phabricator.wikimedia.org/T207081) [19:49:57] (03PS10) 10Urbanecm: [tests] Ensure only existing wikis are referenced from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467448 [19:50:29] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:51:02] (03CR) 10jerkins-bot: [V: 04-1] [tests] Ensure only existing wikis are referenced from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467448 (owner: 10Urbanecm) [19:55:57] (03CR) 10Mforns: "The last version of refinery-source with since/until parameters (and refinery) has been deployed. So, I think this can be continued to be " [puppet] - 10https://gerrit.wikimedia.org/r/465692 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns) [19:59:52] 10Operations, 10Performance-Team, 10monitoring, 10Availability, 10Patch-For-Review: Perform a statsd and Graphite switch - https://phabricator.wikimedia.org/T206963 (10Krinkle) [20:00:05] cscott, arlolra, subbu, bearND, halfak, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181015T2000). [20:00:16] 10Operations, 10monitoring, 10Availability, 10Patch-For-Review, 10Performance-Team (Radar): Perform a statsd and Graphite switch - https://phabricator.wikimedia.org/T206963 (10Imarlier) [20:01:19] 10Operations, 10DBA, 10MediaWiki-Database, 10Datacenter-Switchover-2018, and 3 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Imarlier) [20:01:49] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:02:53] onimisionipe: you around? [20:03:27] volans: yes [20:03:35] But on mobile [20:03:43] wdqs-updater has failed on wdqs1010, should be just restarted? I saw also an alarm for high lag on the same host [20:04:26] Oh...ok. wait. Let me alert Stas first [20:04:58] ack, SMalyshev ^^^ :) [20:04:58] Its a test node. Nothing live is hitting it [20:05:04] no, leave it alone for now. It's lagging badly for some reason, and I am trying to figure out what's up [20:05:26] ack [20:05:27] something is messed up in there, it has no traffic but huge lag [20:05:28] Oh...ok [20:06:10] We should probably disable notification [20:08:38] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:10:15] volans: thanks for the ping! [20:10:24] np :) [20:10:33] let me know if I can help [20:11:19] Sure [20:13:08] (03PS4) 10BBlack: interface::rps: support tg3 properly [puppet] - 10https://gerrit.wikimedia.org/r/467443 (https://phabricator.wikimedia.org/T206105) [20:13:10] (03PS1) 10BBlack: interface::rps: always be NUMA aware [puppet] - 10https://gerrit.wikimedia.org/r/467469 [20:15:59] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational [20:17:39] mobrovac, _joe_, you around? /cc arlolra [20:17:47] (03PS1) 10Urbanecm: Use testwikidatawiki instead of testwikidata in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467509 (https://phabricator.wikimedia.org/T207089) [20:18:55] looks like after the data-center switch ... all parsoid traffic is now in eqiad (and before that everyting in codfw after the first switch) .. before that ... you had a different setup where background jobs would go to codfw (secondary) and everything else would go to eqiad (primary). [20:19:13] (03PS2) 10Urbanecm: Use testwikidatawiki instead of testwikidata in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467509 (https://phabricator.wikimedia.org/T207089) [20:19:19] it is not a problem since it all works as expected, but wondering if this was an intentional change. [20:19:20] (03PS11) 10Urbanecm: [tests] Ensure only existing wikis are referenced from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467448 [20:19:39] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:19:53] arlolra noticed that during his usual pre-deploy checks and we were wondering what / why it changed. [20:20:23] (03CR) 10jerkins-bot: [V: 04-1] [tests] Ensure only existing wikis are referenced from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467448 (owner: 10Urbanecm) [20:20:36] !log arlolra@deploy1001 Started deploy [parsoid/deploy@b758124]: Updating Parsoid to 8f3ff40 [20:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:33] (03PS12) 10Urbanecm: [tests] Ensure only existing wikis are referenced from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467448 [20:21:47] (03PS13) 10Urbanecm: [tests] Ensure only existing wikis are referenced from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467448 [20:21:49] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:22:08] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body (AttributeError: NoneType object has no attribute get) [20:22:45] (03PS2) 10Urbanecm: Fix typo in IS.php: use ltwiki instead of ltwikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467453 (https://phabricator.wikimedia.org/T207081) [20:22:53] (03PS3) 10Urbanecm: Use testwikidatawiki instead of testwikidata in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467509 (https://phabricator.wikimedia.org/T207089) [20:23:02] 10Operations, 10SRE-Access-Requests: Requesting deplyoment access to servers for Performance Team task for perf-roots - https://phabricator.wikimedia.org/T207090 (10aaron) [20:23:02] (03PS14) 10Urbanecm: [tests] Ensure only existing wikis are referenced from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467448 [20:23:18] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [20:24:26] (03CR) 10jerkins-bot: [V: 04-1] [tests] Ensure only existing wikis are referenced from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467448 (owner: 10Urbanecm) [20:25:19] (03PS15) 10Urbanecm: [tests] Ensure only existing wikis are referenced from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467448 [20:26:42] (03CR) 10jerkins-bot: [V: 04-1] [tests] Ensure only existing wikis are referenced from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467448 (owner: 10Urbanecm) [20:27:01] (03PS16) 10Urbanecm: [tests] Ensure only existing wikis are referenced from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467448 [20:28:47] 10Operations, 10Parsoid, 10Datacenter-Switchover-2018: Parsoid no longer active-active - https://phabricator.wikimedia.org/T207091 (10Arlolra) [20:28:48] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test [20:28:48] read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200) [20:30:59] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [20:32:19] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@b758124]: Updating Parsoid to 8f3ff40 (duration: 11m 43s) [20:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:05] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@834d00a]: Update mobileapps to c2a4ef9 (T206701 T206467 T168875) [20:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:11] T206701: MCS must not pass through `vary: accept` from parsoid - https://phabricator.wikimedia.org/T206701 [20:34:12] T168875: Remove reference sections at end of article from page content HTML - https://phabricator.wikimedia.org/T168875 [20:34:12] T206467: Provide all reference lists - https://phabricator.wikimedia.org/T206467 [20:34:21] 10Operations, 10Parsoid, 10Datacenter-Switchover-2018: Parsoid no longer active-active - https://phabricator.wikimedia.org/T207091 (10ssastry) [[https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?from=now-90d&to=now&cluster=parsoid&orgId=1&var-datasource=codfw%20prometheus%2Fops&var-clus... [20:37:52] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@834d00a]: Update mobileapps to c2a4ef9 (T206701 T206467 T168875) (duration: 03m 47s) [20:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:20] (03PS7) 10Mforns: Add druid_load jobs to analytics refinery [puppet] - 10https://gerrit.wikimedia.org/r/465692 (https://phabricator.wikimedia.org/T206342) [20:52:25] !log Updated Parsoid to 8f3ff40 (T205642, T206003, T187848, T205455, T205743) [20:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:35] T205455: transformTests.js fails to validate QuoteTransformer on a bunch of wiki pages - https://phabricator.wikimedia.org/T205455 [20:52:35] T206003: Beta Cluster: Parsoid config request failures from the MediaWiki API - https://phabricator.wikimedia.org/T206003 [20:52:36] T187848: Fix token transformer return types - https://phabricator.wikimedia.org/T187848 [20:52:36] T205642: Error a.k.toLowerCase is not a function parsing extension token - https://phabricator.wikimedia.org/T205642 [20:52:36] T205743: parse.js genTest mode emits COMMENT instead of CommentTk - https://phabricator.wikimedia.org/T205743 [20:54:48] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) is CRITICAL: Test Retrieve all events for Jan 15 returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 re [20:54:48] ed body (AttributeError: NoneType object has no attribute get) [20:55:17] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: Provide a way to have test servers on real hardware, isolated from production - https://phabricator.wikimedia.org/T206636 (10Andrew) Sorry, my question was poorly phrased. How long would you need to test a given VM box in order to... [20:55:59] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [20:57:08] 10Operations, 10monitoring: Graphite1001 disk usage at 96% - https://phabricator.wikimedia.org/T207040 (10colewhite) # ores appears to be capturing worker-specific metrics at `ores..uwsgi.worker..(...)` The field appears variable and unpredictable. Depending implementat... [21:00:04] bawolff and Reedy: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181015T2100). [21:04:03] (03PS17) 10Urbanecm: [tests] Ensure only existing wikis are referenced from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467448 (https://phabricator.wikimedia.org/T207074) [21:04:09] PROBLEM - High lag on wdqs1003 is CRITICAL: 3622 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [21:04:15] (03PS18) 10Urbanecm: [tests] Ensure only existing wikis are referenced from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467448 (https://phabricator.wikimedia.org/T207074) [21:12:31] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: Provide a way to have test servers on real hardware, isolated from production - https://phabricator.wikimedia.org/T206636 (10Smalyshev) > How long would you need to test a given VM box in order to determine whether or not it's an ad... [21:22:42] hey, quick question, I merged the beta config change today, Gergo had a pretty comment about it https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/467437/ [21:23:23] what is the protocol in this case? (When I need to push a config change to beta cluster), I know that beta cluster does the automatic update (post-merge hook) [21:23:51] Ideally you git pull it on the deploy server and then sync-file it in prod so things are consistent [21:24:01] do I go to deploy and pull the that patch? or can I just leave note in commit msg that this is no-op and it changes only betacluster [21:24:54] Reedy - that sounds good, but then why do we have the post-merge hook? [21:25:10] if I have to go to deploy server and deploy it anyway ? :) [21:25:19] because sync-file in prod doesn't do anything on beta [21:25:29] I'm ok with doing that, I'm just curious [21:26:40] Reedy - I see there is security window right now, are you deploying anything? [21:26:50] Nope, I don't believe bawolff is either [21:27:00] can I go and pull my patch and deploy it? [21:27:01] Nope, I'm not doing anything [21:27:16] I checked the git and my patch is not there [21:27:21] I'm not really here. I'm technically away this week [21:27:27] (on deploy serv) [21:28:33] yeah, fine by me [21:28:46] ok, I'm on it /cc bawolff [21:28:59] deploying the https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/467437/ [21:32:47] Reedy, can you lend me a hand? [21:33:03] What do you need a hand with? [21:33:08] I fetched changes on deploy1001, did a git log but I don't see my patch o_O [21:33:27] (03PS1) 10Urbanecm: Add vn.wikimedia.org to wikimedia-chapter [puppet] - 10https://gerrit.wikimedia.org/r/467525 (https://phabricator.wikimedia.org/T207052) [21:33:40] pmiazga: Just git log by itself? [21:33:51] as that only shows the current changes in tree [21:33:59] git fetch doesn't merge the current changes in tree [21:34:02] I'm folowing https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Fetching_patches [21:34:22] `git fetch` and then `git log -p HEAD..@{u}` [21:34:42] Oh, I don't even know what @{u} means [21:34:57] this is what I do during SWAT windows [21:35:23] @{u} is a shortcut to refer to the upstream branch which the current branch is tracking [21:35:28] commit 7f22c1d566d6ca04cb63c34ee4c942ee8d83158a [21:35:29] Author: Piotr Miazga [21:35:29] Date: Mon Oct 15 20:12:54 2018 +0200 [21:35:29] Beta: Show share button on mobile web for beta user [21:35:29] [21:35:29] Bug: T181195 [21:35:30] T181195: Add a share button to the mobile site (currently in beta) - https://phabricator.wikimedia.org/T181195 [21:35:31] Change-Id: I8b093aa99853ce50453a2e7bb75dfd6719c69386 [21:35:36] It's there [21:35:40] Just not on the top [21:35:47] Like 7 commits down [21:35:54] Because I guess someone probably already pulled it [21:36:11] someone pulled it down and didn't freak out? [21:36:24] well, if someone knows what thy're doing... [21:36:32] A change to beta settings shouldn't cause a freak out [21:36:32] Krenair - it's beta change, prefixed with "Beta: ..." [21:36:33] yeah I hope they knew what they were doing. [21:36:44] 10Operations, 10SRE-Access-Requests: Requesting deployment access to servers for Performance Team task for perf-roots - https://phabricator.wikimedia.org/T207090 (10ArielGlenn) [21:36:52] I was told that only beta config changes have to be prefixed with Beta so no one freaks out when there are some extra patches [21:37:07] I'm not aware of any such requirement [21:37:15] 10Operations, 10SRE-Access-Requests: Requesting deployment access to servers for Performance Team task for perf-roots - https://phabricator.wikimedia.org/T207090 (10ArielGlenn) @lmarlier, can you sign off please? [21:37:26] Reedy, yeah, you're right, my bash window is small, I checked like 3-4 changes and I thought that it's not there [21:37:45] there should never be extra patches [21:38:25] yeah, I don't think Beta: prefixing so people ignore the extra patches is a well known practise. [21:39:32] if someone merges to that repo and then doesn't deploy [21:39:34] then someone else comes along [21:39:37] person #1 should be reverted [21:40:14] Krenair, yeah, I agree, but the thing is that the Beta config changes are deployed immediately by the post-merge hook [21:40:27] not in prod they're not [21:40:38] person #2 should never be in a position to have to make that distinction [21:41:11] I agree, I was told wrong, I assumed that just +2 on patch with "Beta:" prefix is enough. I'll deploy those changes from now on [21:41:43] ok, one more thing, what to do now if my patch is already there? do the scap-sync or just leave it as it is [21:42:17] https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:20] 18:30 tgr@deploy1001: Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: Beta: Show share button on mobile web for beta user (no-op) (duration: 00m 49s) [21:42:29] Looks like Gergo deployed it 4 hours ago? [21:43:05] it makes sense now, Gergo added a comment on the patch I merged [21:43:48] (03CR) 10Pmiazga: [C: 032] "Gergo - yeah, thanks, I assumed that if the post-merge updates the beta cluster there are no other actions required." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467437 (https://phabricator.wikimedia.org/T181195) (owner: 10Pmiazga) [21:45:22] (03PS1) 10Urbanecm: Initial configuration for viwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467528 (https://phabricator.wikimedia.org/T207052) [21:48:41] (03PS2) 10Urbanecm: Initial configuration for viwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467528 (https://phabricator.wikimedia.org/T207052) [21:49:53] (03CR) 10Cwhite: [C: 031] "This looks good. Let's get I9ed833306a25035a9c2af70c88754063adccbe9f merged first and test." [puppet] - 10https://gerrit.wikimedia.org/r/467013 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [21:50:57] (03CR) 10Cwhite: [C: 031] "> This looks good. Let's get I9ed833306a25035a9c2af70c88754063adccbe9f" [puppet] - 10https://gerrit.wikimedia.org/r/467013 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [21:51:29] (03PS1) 10BryanDavis: toolforge: expand CSP allowed host list [puppet] - 10https://gerrit.wikimedia.org/r/467532 (https://phabricator.wikimedia.org/T130748) [21:58:39] (03CR) 10Cwhite: [C: 031] "The scope of this change looks fairly wide, but the spot-checking I did looks like it's no-op. https://puppet-compiler.wmflabs.org/compil" [puppet] - 10https://gerrit.wikimedia.org/r/459660 (owner: 10Dzahn) [21:58:55] (03CR) 10Andrew Bogott: [C: 032] toolforge: expand CSP allowed host list [puppet] - 10https://gerrit.wikimedia.org/r/467532 (https://phabricator.wikimedia.org/T130748) (owner: 10BryanDavis) [22:13:16] pmiazga: sorry, looks like I created more confusion [22:13:27] I should have been clearer that it's already deployed [22:13:48] tgr - nah, it's good, thx for the comment, I was sure that I can just +2 on a patch and forget about it [22:14:11] thanks for letting me know that we should push those changes to prod. I was unaware of that [22:14:53] my biggest failure was checking the git log.. my patch was there but pretty far in the log, I checked only first 4-5 entries and assumed that it is not there [22:15:01] technically that would work for beta-only patches, but deployers are trained to freak out when some of the code to be deployed is not theirs [22:15:09] this confused me a lot, but it's my mistake, sorry for that [22:27:58] (03CR) 10Cwhite: [C: 031] "Looks good, pending I54017a2e720d768c3d441b974ce070390c956253" [puppet] - 10https://gerrit.wikimedia.org/r/467015 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [22:30:59] PROBLEM - High lag on wdqs1010 is CRITICAL: 1.696e+05 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [22:45:02] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: delete t206636 VM and revert quota bumps for project wikidata-query - https://phabricator.wikimedia.org/T207101 (10Andrew) p:05Triage>03High [22:49:16] (03CR) 10Cwhite: [C: 04-1] "I don't think this will have the desired effect. The root of the problem looks like the systemd timer is attempting to use a user that do" [puppet] - 10https://gerrit.wikimedia.org/r/467017 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [22:52:29] 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10cwdent) @jkim_wikimedia thanks, I now have enough info to make the accounts and will find time in the next day or two. [22:54:59] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: Provide a way to have test servers on real hardware, isolated from production - https://phabricator.wikimedia.org/T206636 (10Andrew) I've created a temporary VM for this: t206636.wikidata-query.eqiad.wmflabs -- it's on a new and cu... [22:58:57] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10Legoktm) [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181015T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:00:29] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body (AttributeError: NoneType object has no attribute get) [23:01:39] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [23:21:24] (03CR) 10GTirloni: [C: 031] "> Patch Set 5: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/465456 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)